Migrating a Neo4j graph database to Amazon Neptune with a fully automated utility

Amazon Neptune is a fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. You can benefit from the service’s purpose-built, high-performance, fast, scalable, and reliable graph database engine when you migrate data from your existing self-managed graph databases, such as Neo4j.

This post shows you how to migrate from Neo4j to Amazon Neptune by using an example AWS CDK app that utilizes the neo4j-to-neptune command-line utility from the Neptune tools GitHub repo. The example app completes the following tasks:

Set up and configure Neo4j and Amazon Neptune databases
Exports the movies graph from the example project on the Neo4j website as a CSV file
Converts the exported data to the bulk load CSV format in Amazon Neptune by using the neo4j-to-neptune utility
Imports the converted data into Amazon Neptune

Architecture

The following architecture shows the building blocks that you need to build a loosely coupled app for the migration. The app automates the creation of the following resources:

An Amazon EC2 instance to download and install a Neo4j graph database, and Apache TinkerPop Gremlin console for querying Amazon Neptune. This instance acts both as the migration source and as a client to run AWS CLI commands, such as copying exported files to an Amazon S3 bucket and loading data into Amazon Neptune.
An Amazon S3 bucket from which to load data into Neptune.
An Amazon Neptune DB cluster with one graph database instance.

Running the migration

Git clone the AWS CDK app from the GitHub repo. After ensuring you meet the prerequisites, follow the instructions there to run the migration.

The app automates the migration of the Neo4j movies graph database to Amazon Neptune. After you run the app successfully, you see an output similar to the following screenshot in your terminal:

Record the values, such as NeptuneEndpoint, to use in later steps.

The app provisions the Neo4j and Amazon Neptune databases and performs the migration. The following sections explain how the app provisions and runs the migration, and shows you how to use the Gremlin console on the EC2 instance to query Neptune to validate the migration.

Migration overview

The AWS CDK app automates three essential phases of the migration:

Provision AWS infrastructure
Prepare for the migration
Perform the migration

Provisioning AWS infrastructure

When you run the app, it creates the following resources in your AWS account.

Amazon VPC and subnets

The app creates an Amazon VPC denoted by VPCID. You must create Neptune clusters in a VPC, and you can only access their endpoints within that VPC. To access your Neptune database, the app uses an EC2 instance that runs in the same VPC to load data and run queries. You create two /24 public subnets, one in each of two Availability Zones.

EC2 instance

A single EC2 instance denoted by EC2Instance performs the following functions:

Download and install a Neo4j community edition graph database (version `4.0.0`),
Runs AWS CLI commands to copy local files to Amazon S3
Runs AWS CLI commands to load data into Neptune
Runs Apache TinkerPop Gremlin commands to query and verify the data migration to Neptune

S3 bucket

The app creates a single S3 bucket, denoted by S3BucketName, to hold data exported from Neo4j. The app triggers a bulk load of this data from the bucket into Neptune.

Amazon S3 gateway VPC endpoint

The app creates a Neptune database cluster in a public subnet inside the VPC. To make sure that Neptune can access and download data from Amazon S3, the app also creates a gateway type VPC endpoint for Amazon S3. For more information, see Gateway VPC Endpoints.

A single-node Neptune cluster

This is the destination in this migration—the target Neptune graph database denoted by NeptuneEndpoint. The app loads the exported data into this database. You can use the Gremlin console on the EC2 instance to query the data.

Required AWS IAM roles and policies

To allow access to AWS resources, the app creates all the required roles and policies necessary to perform the migration.

Preparing for the migration

After provisioning the infrastructure, the app automates the steps shown in the diagram below:

Create a movie graph in Neo4j

The app uses bootstrapping shell scripts to install and configure Neo4j community edition 4.0 on the EC2 instance. The scripts then load the Neo4j movies graph into this database.

Export the graph data to a CSV file

The app uses the following Neo4j Cypher script to export all nodes and relationships into a comma-delimited file:

CALL apoc.export.csv.all('neo4j-export.csv', {d:','});

The following code shows the location of the saved exported file:

/var/lib/neo4j/import/neo4j-export.csv

As part of automating the Neo4j configuration, the app installs the APOC library, which contains procedures for exporting data from Neo4j, and edits the neo4j.conf file with the following code so that it can write to a file on disk:

apoc.export.file.enabled=true

The app also whitelists Neo4j’s APOC APIs in the neo4j.conf file to use them. See the following code:

dbms.security.procedures.unrestricted=apoc.*

Performing the migration

In this phase, the app migrates the data to Neptune. This includes the following automated steps.

Transform Neo4j exported data to Gremlin load data format

The app uses the neo4j-to-neptune command-line utility to transform the exported data to a Gremlin load data format with a single command. See the following code:

$ java -jar neo4j-to-neptune.jar convert-csv -i /var/lib/neo4j/import/neo4j-export.csv -d output --infer-types

The neo4j-to-neptune utility creates an output folder and copies the results to separate files: one each for vertices and edges. The utility has two required parameters: the path to the Neo4j export file (/var/lib/neo4j/import/neo4j-export.csv) and the name of a directory (output) where the converted CSV files are written. There are also optional parameters that allow you to specify node and relationship multi-valued property policies and turn on data type inferencing. For example, the --infer-types flag tells the utility to infer the narrowest supported type for each column in the output CSV as an alternative to specifying the data type for each property. For more information, see Gremlin Load Data Format.

The neo4j-to-neptune utility addresses differences in the Neo4j and Neptune property graph data models. Neptune’s property graph is very similar to Neo4j’s, including support for multiple labels on vertices, and multi-valued properties (sets but not lists). Neo4j allows homogeneous lists of simple types that contain duplicate values to store as properties on both nodes and edges. Neptune, on the other hand, provides for set and single cardinality for vertex properties, and single cardinality for edge properties. The neo4j-to-neptune utility provides policies to migrate Neo4j node list properties that contain duplicate values into Neptune vertex properties, and Neo4j relationship list properties into Neptune edge properties. For more information, see the GitHub repo.

Copy the output data to Amazon S3

The export creates two files: edges.csv and vertices.csv. These files are located in the output folder. The app copies these files to the S3 bucket created specifically for this purpose. See the following code:

$ aws s3 cp /output/ s3://<S3BucketName>/neo4j-data --recursive

Load data into Neptune

The final step of the automated migration uses the Neptune bulk load AWS CLI command to load edges and vertices into Neptune. See the following code:

curl -X POST \
    -H 'Content-Type: application/json' \
    <NeptuneLoaderEndpoint> -d '
    { 
      "source": "s3://<S3BucketName>/neo4j-data", 
      "format": "csv",  
      "iamRoleArn": "arn:aws:iam::<AWSAccount>:role/<NeptuneTrustedS3Role>", 
      "region": "<AWSRegion>", 
      "failOnError": "FALSE"
    }'

For more information, see Loading Data into Amazon Neptune.

Verifying the migration

After the automated steps are complete, you are ready to verify that the migration was successful.

Amazon Neptune is compatible with Apache TinkerPop3 and Gremlin 3.4.5. This means that you can connect to a Neptune DB instance and use the Gremlin traversal language to query the graph.

To verify the migration, complete the following steps:

Connect to the EC2 instance after it passes both the status checks.
For more information, see Types of Status Checks.

Use the value of NeptuneEndpoint to execute the following commands:

$ docker run -it -e NEPTUNE_HOST=<NeptuneEndpoint> sanjeets/neptune-gremlinc-345:latest

At the prompt, execute the following command to send all your queries to Amazon Neptune.
```
:remote console
```
Execute the following command to see the number of verticies migrated.
```
g.v() .count()
```
The following screenshot shows the output of the command g.V().count().
You can now, for example, run a simple query that gives you all the movies in which Tom Cruise acted. The following screenshot shows the intended output.

Cleaning up

After you run the migration, clean up all the resources the app created with the following code:

npm run destroy

Conclusion

Neptune is a fully managed graph database service that makes it easy to focus on building great applications for your customers instead of worrying about database management tasks like hardware provisioning, software patching, setup, configuration, or backups. This post demonstrated how to migrate Neo4j data to Neptune in a few simple steps.

About the Author

Sanjeet Sahay is a Sr. Partner Solutions Architect with Amazon Web Services.