AWS Database Blog

Analyze Amazon Neptune Graphs using Amazon SageMaker Jupyter Notebooks

Whether you’re creating a new graph data model and queries, or exploring an existing graph dataset, it can be useful to have an interactive query environment that allows you to visualize the results. In this blog post we show you how to achieve this by connecting an Amazon SageMaker notebook to an Amazon Neptune database. Using the notebook, you load data into the database, query it and visualize the results.

Amazon Neptune is a fast and reliable graph database. It’s ideal when your query workloads require navigating connections and leveraging the strength, weight, or quality of the relationships between entities.

Amazon SageMaker is a fully-managed platform for building, training, and developing machine learning models. In this blog post we use SageMaker for its ability to provide hosted Jupyter notebooks. With just a few clicks you can create a Jupyter notebook, connect it to Neptune, and start querying your database.

Solution overview

The solution presented in this blog post creates the following resources:

  • Neptune VPC with three subnets and a VPC S3 endpoint
  • Neptune cluster comprising a single r4.xlarge instance, with appropriate subnet, parameter and security groups
  • IAM role that allows Neptune to load data from S3
  • SageMaker Jupyter notebook instance with IPython Gremlin extension modules, Gremlin console, and some sample notebook content

  1. Your Neptune database’s endpoint is provisioned in a new VPC in your account.
  2. SageMaker’s Jupyter notebook is hosted in an Amazon SageMaker VPC.
  3. SageMaker creates an Elastic Network Interface (ENI) in your Neptune VPC that allows your notebook to connect to your Neptune database.
  4. Notebook content is loaded from Amazon S3 into the notebook using a SageMaker lifecycle configuration script.
  5. Neptune allows you to bulk load data from an Amazon S3 bucket (this can be a different bucket from the one used to store the notebook content).
  6. To access files in S3, Neptune uses a VPC S3 endpoint in your Neptune VPC.

Launch the Neptune-SageMaker stack

Launch the Neptune-SageMaker stack from the AWS CloudFormation console by choosing one of the Launch Stack buttons in the following table. Acknowledge that AWS CloudFormation will create IAM resources, and then choose Create.

The Neptune and SageMaker resources described here incur costs. With SageMaker hosted notebooks you pay simply for the Amazon EC2 instance that hosts the notebook. In this blog post, we use an ml.t2.medium instance, which is eligible for the AWS Free Tier.

Region View Launch
US East 1
(N. Virginia)
View
US East 2
(Ohio)
View
US West 2
(Oregon)
View
EU West 1
(Ireland)
View
EU West 2
(London)
View
EU Central 1
(Frankfurt)
View

The solution creates four stacks:

Start your notebook instance

After the stacks have been created, open the Amazon SageMaker console and from the left-hand menu select Notebook instances. Choose Open in the Actions column.

In the Jupyter window, open the Neptune directory, and then the Getting-Started directory.

Browse and run the content

The Getting-Started directory contains three notebooks:

  • 01-Introduction.ipynb
  • 02-Labelled-Property-Graph.ipynb
  • 03-Social-Network-Recommendations.ipynb

 The first two introduce Amazon Neptune and the property graph data model. The third contains an executable example of a social network recommendation engine. When you run the content, the notebook populates Neptune with a sample social network dataset, and then issues several queries to generate People-You-May-Know (PYMK) recommendations.

To see this in action, open 03-Social-Network-Recommendations.ipynb and run each of the cells in turn, or choose Run All from the Cell dropdown menu. You should see the results of each query printed below each query cell. Here’s an example:

Create your own notebook

Now that you’ve seen an example of querying Neptune from a Jupyter notebook, you’re ready to create your own notebook.

To create a new notebook, in the Jupyter window, choose New and select conda_python3.

Run neptune.py

Our solution installs a Python helper module, neptune.py in a util directory. This helper module makes it easy to create traversal sources, which act as the starting points for queries, and load data into Neptune. At the beginning of your notebook script, run this helper module:

%run 'util/neptune.py'

You might have to modify the path to neptune.py, depending on where your new notebook is relative to the util directory.

Drop existing data

If you’ve run other notebooks against your Neptune cluster, it will likely contain some existing data. To remove this data, run the following command:

neptune.clear()

How does this command know which cluster to connect to? The CloudFormation templates that created the Neptune cluster and SageMaker notebook also populated the notebook environment with a couple of environment variables, NEPTUNE_CLUSTER_ENDPOINT and NEPTUNE_CLUSTER_PORT. These variables contain the details of the Neptune cluster. By default, all the neptune.py helper methods use these environment variables, but you can override this behavior.

neptune.clear(neptune_endpoint=<cluster-endpoint>, neptune_port=<port>)

Create some data

You can insert data into Neptune from your notebook in two different ways. You can use a Gremlin client to create vertices and edges by submitting queries to the online Gremlin endpoint, or you can bulk load data from an Amazon S3 bucket.

To insert some data from your notebook using Gremlin, create a traversal source in your notebook script, and then issue a query.

g = neptune.graphTraversal() 
g.addV('Person').property('name', 'Jane Smith').next()

The neptune.graphTraversal() helper method creates a remote connection and binds the variable g to a traversal source. This traversal source acts as the starting points for your queries. You don’t need to initialize a new traversal source for every query. You can reuse g throughout your notebook.

As an alternative to creating data online via your notebook script, you can bulk load data into Neptune from S3. Our helper module makes it easy to use Neptune’s bulk loader API.

To bulk load property graph data into Neptune, the data must be formatted as CSV, with edges and vertices in separate files. The S3 bucket containing the source files must be in the same region as the Neptune cluster.

You can start a bulk load from your notebook with the following command:

neptune.bulkLoad('s3://your-bucket-${AWS_REGION}/path-to-your-files/')

If you include the ${AWS_REGION} placeholder in your S3 path, the bulkLoad() helper method will replace this with the name of the AWS region in which your Neptune cluster and SageMaker notebook are running.

bulkLoad() blocks until the load is complete. For large loads, use the async version of the method, and then check the load status using bulkLoadStatus().

status_url = neptune.bulkLoad('<s3-path-to-your-files>')
(status, response) = neptune.bulkLoadStatus(status_url)

The load is complete when status is LOAD_COMPLETED. Put these two lines in different cells in your notebook, so that you can trigger the load once, but check the status repeatedly.

Run some queries

To query Neptune with Gremlin from Jupyter use the gremlinpython package, which implements Gremlin within the Python language. There are a couple of things to remember when using gremlinpython:

  • Python reserved words – as, in, and, or, is, not, from, and global – must be postfixed with an underscore.
  • You must use a terminal action – next(), nextTraverser(), toList(), toSet(), or iterate() – to submit a traversal to the Gremlin server.

The query below illustrates using postfix underscores with reserved words, and the use of a terminal action.

g.V().as('a').in_().as_('b').select('a','b').toList()

Using the Gremlin console

As an alternative to writing queries in a notebook, you can use the Gremlin console to interact with Neptune. Our SageMaker setup installs the Gremlin console on your notebook instance. To use the console, open your Jupyter notebook instance, choose New and then choose Terminal. This is shown below.

With the terminal open, go to the tools/apache-tinkerpop-gremlin-console-3.3.3 directory, and start the console:

Over to you

You can reuse the dataset and assets created in this blog post in several different ways.

What if I want to reuse an existing Neptune cluster with SageMaker?

No problem. Instead of running the root CloudFormation template (neptune-sagemaker-base-stack.json), run the neptune-sagemaker-nested-stack.json template instead. You’ll need to supply the following additional parameters to this template:

  • NeptuneClusterEndpoint Cluster endpoint of your existing Neptune cluster. You can get this information from the cluster details tab of your Neptune cluster.
  • NeptuneClusterPort Port of your existing Neptune cluster.
  • NeptuneClusterVpc VPC ID of the VPC in which your Neptune cluster is running. You can get this information from the instance details tab of your Neptune cluster.
  • NeptuneClusterSubnetId ID of one of the subnets in which your Neptune cluster is running. You can get this information from the instance details tab of your Neptune cluster.
  • NeptuneClientSecurityGroup A VPC security group with access to your Neptune cluster. Leave empty only if the Neptune cluster allows access from anywhere.
  • NeptuneLoadFromS3RoleArn ARN of the IAM role that allows Amazon Neptune to access Amazon S3 resources. This ARN is used when a notebook populates the database by submitting a load request to the loader API.
  • NotebookContentS3Locations Comma-separated S3 locations of the notebooks to install into the notebook instance.

If you’re using an existing cluster, you may want to create a snapshot of your database before you drop data and start a notebook. If you have a large existing dataset, we recommend creating a new instance rather than trying to drop the data. This solution does not work with a Neptune cluster that has had IAM database authentication enabled.

Reusing the CloudFormation templates with your own notebooks

You can reuse the CloudFormation templates included with this blog post to run your own notebook content against a new Neptune cluster or an existing Neptune cluster. Simply replace the NotebookContentS3Locations parameter value with the S3 location of your own notebook content. If you leave the parameter empty, the templates will create an empty Jupyter instance with all the necessary IPython extensions, plus the Gremlin console, pre-installed.

Details of the CloudFormation stack

The following CloudFormation stacks are included with this solution:

  • neptune-sagemaker-base-stack.json This is the root stack.
  • neptune-base-stack.json This stack is supplied as part of the Neptune Quick Start. The stack creates a new VPC with three subnets, a Neptune database cluster, and the necessary Neptune subnet, database parameter and security groups. The template also creates a VPC S3 endpoint, and an IAM role that allows Neptune to access S3 content. However, the stack does not attach the IAM role to the Neptune cluster.
  • add-iam-role-to-neptune.json This stack creates a custom CloudFormation resource that uses an AWS Lambda function to attach the S3 access IAM role created by the previous template to the Neptune cluster.
  • neptune-sagemaker-nested-stack.json This stack creates a SageMaker Jupyter notebook instance in an Amazon SageMaker VPC. It creates a network interface in the Neptune VPC to enable traffic between the notebook instance and your Neptune cluster. The template installs the Gremlin console, some IPython Gremlin extension modules, and the specified notebook content into your notebook instance.

SageMaker notebook access to Neptune

Access to the Neptune Gremlin endpoint is restricted to clients situated in the same VPC. SageMaker however, creates Jupyter notebook instances in an Amazon SageMaker VPC.

To allow SageMaker to connect to Neptune, choose the optional VPC configuration, specifying the Neptune VPC and one of the Neptune subnets when you create the notebook instance. SageMaker then creates an elastic network interface in the specified subnet, thereby connecting the notebook instance to the Neptune VPC.

The neptune-sagemaker-nested-stack.json CloudFormation stack creates a new SageMaker security group that it associates with the subnet in the Neptune VPC. You can override this behavior by supplying the ID of an existing security group that has inbound access to the Neptune VPC security group.

Automatically associating an S3 IAM role with Neptune

Loading data from an S3 bucket requires an AWS Identity and Access Management (IAM) role that has access to the bucket. When you supply the ARN of this role to the loader API, Neptune assumes the role in order to load the data.

The neptune-base-stack.json CloudFormation stack creates this IAM role. However, the Neptune DBCluster CloudFormation resource doesn’t provide a property that allows this role to be associated with the cluster when the cluster is created. This is why you use another CloudFormation template, add-iam-role-to-neptune.json, with a custom CloudFormation resource that uses an AWS Lambda function to associate the role with the cluster. The custom resource is based on the Stelligent examples in GitHub, but with the Lambda Python code in the Lambda CloudFormation resource. This serves to avoid deploying code to buckets in multiple S3 Regions. The bucket containing a code deployment package must reside in the same AWS Region in which the Lambda function is being created.

Our Python-based Lambda function uses boto3 to call Neptune’s AddRoleToDBCluster resource management API.

client = boto3.client('neptune')
    client.add_role_to_db_cluster(
        DBClusterIdentifier=dbClusterId,
        RoleArn=iamRoleArn
    )

Environment variables and the neptune.py helper module

The SageMaker lifecycle configuration script included with this example creates several environment variables that you can access from your Python scripts:

  • NEPTUNE_CLUSTER_ENDPOINT Neptune cluster endpoint that was supplied to the sagemaker-neptune.json template.
  • NEPTUNE_CLUSTER_PORT Neptune cluster port that was supplied to the sagemaker-neptune.json template.
  • NEPTUNE_LOAD_FROM_S3_ROLE_ARN S3 role ARN that was supplied to the sagemaker-neptune.json template.
  • AWS_REGION The AWS region in which the SageMaker notebook and Neptune database are running.

The lifecycle configuration script also installs a neptune.py helper module into the Neptune notebook directory on your Jupyter notebook instance. This module includes several methods:

  • clear() Drops the data in the Neptune database.
  • bulkLoad() Blocking call that loads data into Neptune from S3. If you supply a ${AWS_REGION} placeholder in the S3 path, bulkLoad() will replace this with the region supplied to the method, or the value of the AWS_REGION environment variable. If you publish the data to be loaded into Neptune to different S3 buckets with region-specific names in each of the Neptune regions, you can then take advantage of this placeholder when writing your own notebook content.
  • bulkLoadAsync() Triggers the bulk load process and immediately returns a URL you can use to check the status of the load.
  • bulkLoadStatus() Given a bulk load status_url, this method checks the progress of the load, and returns a (status, jsonresponse) The load is complete when status is LOAD_COMPLETED.
  • graphTraversal() Creates a graph traversal source bound to the variable g, which you can then use to refer to the graph in subsequent Gremlin queries.

For each method, you can supply a Neptune cluster endpoint and port. If you don’t, the method will use the values of the NEPTUNE_CLUSTER_ENDPOINT and NEPTUNE_CLUSTER_PORT environment variables.

Conclusion

Amazon Neptune allows you to store and query highly connected data. With Amazon SageMaker-hosted Jupyter notebooks, you can easily connect to, query, and visualize your Neptune graph.

In this blog post we’ve provided you with AWS CloudFormation templates that make it easy to spin up a Neptune cluster and a SageMaker notebook environment so you can create your own graph data models and queries, and notebook content.

If you have any questions or comments about this blog post, feel free to use the comments section below.


About the Authors

Ian Robinson is an architect with the Database Services Customer Advisory Team. He is a coauthor of ‘Graph Databases’ and ‘REST in Practice’ (both from O’Reilly) and a contributor to ‘REST: From Research to Practice’ (Springer) and ‘Service Design Patterns’ (Addison-Wesley)

 

 

 

Kelvin Lawrence is a Principal Data Architect in the Database Services Customer Advisory Team focused on Amazon Neptune and many other related services. He has been working with graph databases for many years, is the author of the book “Practical Gremlin” and is a committer on the Apache TinkerPop project.