Moving to the cloud: Migrating Blazegraph to Amazon Neptune

During the lifespan of a graph database application, the applications themselves tend to only have basic requirements, namely a functioning W3C standard SPARQL endpoint. However, as graph databases become embedded in critical business applications, both businesses and operations require much more. Critical business infrastructure is required not only to function, but also to be highly available, secure, scalable, and cost-effective. These requirements are driving the desire to move from on-premises or self-hosted solutions to a fully managed graph database solution such as Amazon Neptune.

Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run business-critical graph database applications. Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latency. Neptune is designed to be highly available, with read replicas, point-in-time recovery, continuous backup to Amazon Simple Storage Service (Amazon S3), and replication across Availability Zones. Neptune is secure with support for AWS Identity and Access Management (IAM) authentication, HTTPS-encrypted client connections, and encryption at rest. Neptune also provides a variety of instance types, including low-cost instances targeted at development and testing, which provide a predictable, low-cost, managed infrastructure.

When choosing to migrate from current on-premises or self-hosted graph database solutions to Neptune, what’s the best way to perform this migration?

This post demonstrates how to migrate from the open-source RDF triplestore Blazegraph to Neptune by completing the following steps:

Provision AWS infrastructure. Begin by provisioning the required AWS infrastructure using an AWS CloudFormation template.
Export data from Blazegraph. This post examines the two main methods for exporting data from Blazegraph: via SPARQL CONSTRUCT queries or using the Blazegraph Export utility.
Import the data into Neptune. Load the exported data files into Neptune using the Neptune Workbench and the Neptune bulk loader.

This post also examines the differences you need to be aware of while migrating between the two databases. Although this post is targeted at those migrating from Blazegraph, the approach is generally applicable for migration from other RDF triplestore databases.

Architecture

Before covering the migration process, let’s examine the fundamental building blocks of the architecture used throughout this post. This architecture consists of four main components:

Blazegraph instance – The instance contains the data to migrate. This may be a self-managed Amazon Elastic Compute Cloud (Amazon EC2) instance, an on-premises server, or a local installation.
S3 bucket – You configure the bucket to load data into Neptune.
Neptune DB cluster – The Neptune DB cluster has at least a single writer instance (read instances are optional).
Neptune Workbench – You use the workbench to run the bulk load and validate the results.

The following diagram summarizes these resources and illustrates the solution architecture.

Provisioning the AWS infrastructure

Although it’s possible to construct the required AWS infrastructure manually through the AWS Management Console or CLI, this post uses a CloudFormation template to create the majority of the required infrastructure.

Navigate to the page Using an AWS CloudFormation Stack to Create a Neptune DB Cluster.
Choose Launch Stack in your preferred Region.
Set the required parameters (stack name and EC2SSHKeyPairName); you can also set the optional parameters to ease the migration process:

- For AttachBulkloadIAMRoleToNeptuneCluster, choose true.
  This parameter allows for creating and attaching the appropriate IAM role to your cluster to allow for bulk loading data.
- For NotebookInstanceType, choose your preferred instance type.
  This parameter creates a Neptune workbook that you use to run the bulk load into Neptune and validate the migration.

Choose Next.
Set any preferred stack options.
Choose Next.
Review your options and select both check boxes to acknowledge that AWS CloudFormation may require additional capabilities and choose Create stack as shown in the following image.
The stack creation process can take a few minutes.
When the stack is complete, create an Amazon S3 VPC endpoint.
You now have an endpoint gateway. The following image shows its configuration.
With the endpoint configured, you have completed provisioning the AWS Infrastructure and are ready to export data from Blazegraph.

Solution overview

The process of exporting data from Blazegraph involves three steps:

Export the data using CONSTRUCT or the Blazegraph Export utility.
Create an S3 bucket to store the data.
Upload your exported files to the S3 bucket.

Exporting the data

The first step is exporting the data out of Blazegraph in a format that’s compatible with the Neptune bulk loader. For more information about supported formats, see RDF Load Data Formats.

Depending on how the data is stored in Blazegraph (triples or quads) and how many named graphs are in use, Blazegraph may require that you perform the export process multiple times and generate multiple data files. If the data is stored as triples, you need to run one export for each named graph. If the data is stored as quads, you may choose to either export data in N-Quads format or export each named graph in a triples format. For this post, you export a single namespace as N-Quads, but you can repeat the process for additional namespaces or desired export formats.

There are two recommended methods for exporting data from Blazegraph. Which one you choose depends if the application needs to be online and available during the migration.

If it must be online, we recommend using SPARQL CONSTRUCT queries. With this option, you need to install, configure, and run a Blazegraph instance with an accessible SPARQL endpoint.

If the application is not required to be online, we recommend using the BlazeGraph Export utility. With this option, you must download Blazegraph, and the data file and configuration files need to be accessible, but the server doesn’t need to be running.

SPARQL CONSTRUCT queries

SPARQL CONSTRUCT queries are a feature of SPARQL that returns an RDF graph matching the query template specified. For this use case, you use them to export your data one namespace at a time using the following query:

CONSTRUCT WHERE { hint:Query hint:analytic "true" . hint:Query hint:constructDistinctSPO "false" . ?s ?p ?o }

Although a variety of RDF tools to export this data exist, the easiest way to run this query is by using the REST API endpoint provided by Blazegraph. The following script demonstrates how to use a Python (3.6+) script to export data as N-Quads:

import requests

# Configure the URL here: e.g. http://localhost:9999/sparql
url = "http://localhost:9999/sparql"
payload = {'query': 'CONSTRUCT WHERE { hint:Query hint:analytic "true" . hint:Query hint:constructDistinctSPO "false" . ?s ?p ?o }'}
# Set the export format to be n-quads
headers = {
'Accept': 'text/x-nquads'
}
# Run the http request
response = requests.request("POST", url, headers=headers, data = payload, files = [])
#open the file in write mode, write the results, and close the file handler
f = open("export.nq", "w") 
f.write(response.text)
f.close()

If the data is stored as triples, you need to change the ‘Accept’ header parameter to export data in an appropriate format (N-Triples, RDF/XML, or Turtle) using the values specified on the GitHub repo.

Although performing this export using the REST API is one way to export your data, it requires a running server and sufficient server resources to process this additional query overhead. This isn’t always possible, so how do you perform an export on an offline copy of the data?

For those use cases, you can use the Blazegraph Export utility to get an export of the data.

Blazegraph Export utility

Blazegraph contains a utility method to export data: the ExportKB class. This utility facilitates exporting data from Blazegraph, but unlike the previous method, the server must be offline while the export is running. This makes it the ideal method to use when you can take the application offline during migration, or the migration can occur from a backup of the data.

You run the utility via a Java command line from a machine that has Blazegraph installed but not running. The easiest way to run this command is to download the latest blazegraph.jar release located on GitHub. Running this command requires several parameters:

log4j.primary.configuration – The location of the log4j properties file.
log4j.configuration – The location of the log4j properties file.
output – The output directory for the exported data. Files are located as a tar.gz in a subdirectory named per knowledge base.
format – The desired output format followed by the location of the RWStore.properties file. If you’re working with triples, you need to change the -format parameter to N-Triples, Turtle, or RDF/XML.

For example, if you have the Blazegraph journal file and properties files, export data as N-Quads with the following code:

java -cp blazegraph.jar \
       com.bigdata.rdf.sail.ExportKB \
       -outdir ~/temp/ \
       -format N-Quads \
       ./RWStore.properties

Upon successful completion, you see a message similar to the following code:

Exporting kb as N-Quads on /home/ec2-user/temp/kb
Effective output directory: /home/ec2-user/temp/kb
Writing /home/ec2-user/temp/kb/kb.properties
Writing /home/ec2-user/temp/kb/data.nq.gz
Done

No matter which option you choose, you can successfully export your data from Blazegraph in a Neptune-compatible format. You can now move on to migrating these data files to Amazon S3 to prepare for bulk load.

Creating an S3 bucket

With your data exported from Blazegraph, the next step is to create a new S3 bucket. This bucket holds the data files exported from Blazegraph for the Neptune bulk loader to use. Because the Neptune bulk loader requires low latency access to the data during load, this bucket needs to be located in the same Region as the target Neptune instance. Other than the location of the S3 bucket, no specific additional configuration is required.

You can create a bucket in a variety of ways:

On the Amazon S3 console – For instructions, see Creating a bucket.
Via the AWS CLI – For instructions, see Using high-level (s3) commands with the AWS CLI.

Programmatically using the AWS SDK – The following code uses the Python boto3 SDK to create your S3 bucket:

import boto3

region = '<insert region name>'
bucket_name='<insert bucket name>'
s3_client = boto3.client('s3', region_name=region)
location = {'LocationConstraint': region}
s3_client.create_bucket(Bucket=bucket_name, CreateBucketConfiguration=location)

You use the newly created S3 bucket location to bulk load the data into Neptune.

Uploading data files to Amazon S3

The next step is to upload your data files from your export location to this S3 bucket. As with the bucket creation, you can do this in the following ways:

On the Amazon S3 console – For instructions, see Uploading an object to a bucket.
Via the AWS CLI – For instructions, see Using high-level (s3) commands with the AWS CLI.

Programmatically – See the following code:

import boto3

region = '<insert region name>'
bucket_name='<insert bucket name>'
s3 = boto3.resource('s3')
s3.meta.client.upload_file('export.nq', bucket_name, 'export.nq')

Although this example code only loads a single file, if you exported multiple files, you need to upload each file to this S3 bucket.

After loading all the files in your S3 bucket, you’re ready for the final task of the migration: importing data into Neptune.

Importing data into Neptune

Because you exported your data from Blazegraph and made it available via Amazon S3, your next step is to import the data into Neptune. Neptune has a bulk loader that loads data faster and with less overhead than performing load operations using SPARQL. The bulk loader process is started by a call to the loader endpoint API to load data stored in the identified S3 bucket into Neptune. This loading process happens in three steps:

Neptune bulk load
Data loading
Data validation

The following diagram illustrates how we will perform these steps in our AWS infrastructure.

Running the Neptune bulk load

You begin the import process by making a request into Neptune to start the bulk load. Although this is possible via a direct call to the loader REST endpoint, you must have access to the private VPC in which the target Neptune instance runs. You could set up a bastion host, SSH into that machine, and run the cURL command, but Neptune Workbench is an easier method.

Neptune Workbench is a preconfigured Jupyter notebook which is an Amazon SageMaker notebook, with several Neptune-specific notebook magics installed. These magics simplify common Neptune interactions, such as checking the cluster status, running SPARQL and Gremlin traversals, and running a bulk loading operation.

To start the bulk load process use the %load magic, which provides an interface to run the Neptune loader API.

On the Neptune console, choose Notebooks.
Select aws-neptune-blazegraph-to-neptune.
Choose Open notebook.
After opening you are redirected to the running instance of the Jupyter notebook, as seen in the following image.
From here, you may either select an existing notebook or create a new one using the Python 3 kernel.
In your open notebook, open a cell, enter %load, and run the cell.
You must now set the parameters for the bulk loader.
For Source, enter the location of your source file (s3://{bucket_name}/{file_name}).
For Format, choose the appropriate format, which in our example is nquads.
For Load ARN, enter the ARN for the IAMBulkLoad (This information is located on the IAM console under Roles.)
Choose Submit.

The result contains the status of the request. Bulk loads are long-running processes; this response doesn’t mean that the load is complete, only that it has begun. This status updates periodically to provide the most recent loading job status until the job is complete. When loading is complete, you receive notification of the job status.

With your loading job having completed successfully your data is loaded into Neptune and you’re ready to move on to the final step of the import process: validating the data migration.

Validating the data load

As with any data migration, you can validate that the data migrated correctly in several ways. These tend to be specific to the data you’re migrating, the confidence level required for the migration, and what is most important in the particular domain. In most cases, these validation efforts involve running queries that compare the before and after data values.

To make this easier, the Neptune Workbench notebook has a magic (%%sparql) that simplifies running SPARQL queries against your Neptune cluster. See the following code.

%%sparql

SELECT * WHERE {
	?s ?p ?o
} LIMIT 10

This Neptune-specific magic runs SPARQL queries against the associated Neptune instance and returns the results in tabular form.

Blazegraph to Neptune compatibility

The last thing you need to investigate is any application changes that you may need to make due to the differences between Blazegraph and Neptune. Luckily, both Blazegraph and Neptune are compatible with SPARQL 1.1, meaning that you can change your application configuration to point to your new Neptune SPARQL endpoint, and everything should work.

However, as with any database migration, several differences exist between the implementations of Blazegraph and Neptune that may impact your ability to migrate. The following major differences either require changes to queries, the application architecture, or both, as part of the migration process:

Full-text search – In Blazegraph, you can either use internal full-text search or external full-text search capabilities through an integration with Apache Solr. If you use either of these features, stay informed of the latest updates on the full-text search features that Neptune supports. For more information, see Amazon Neptune Full-Text Search Using Amazon OpenSearch Service.
Query hints – Both Blazegraph and Neptune extend SPARQL using the concept of query hints. During a migration, you need to migrate any query hints you use. For more information about the latest query hints Neptune supports, see SPARQL Query Hints.
Inference – Blazegraph supports triples mode inference as a configurable option, but doesn’t support inference in quads mode. If you require inference for your use case, several forward-chained options may be suitable. AWS is interested in learning about use cases for inference. You can add comments or reach out to the Neptune team via the Amazon Neptune discussion forums.
Geospatial search – Blazegraph supports the configuration of namespaces that enable geospatial support. If you use this feature in Blazegraph, you need to consider alternatives within Neptune. You can reach out to the Neptune team on the Amazon Neptune discussion forums.
Multi-tenancy – Blazegraph supported multi-tenancy within a single database. In Neptune, multi-tenancy is supported by either storing data in named graphs and using the USING NAMED clauses for SPARQL queries, or by creating a separate database cluster for each tenant.
Federation – Neptune currently supports SPARQL 1.1 federation to locations accessible to the Neptune instance, such as within the private VPC, across VPCs, or to external internet endpoints. Depending on the specific setup and required federation endpoints, you may need some additional network configuration.
Blazegraph standards extensions – Blazegraph included multiple extensions to both the SPARQL and REST API standards. Neptune is compatible with the standards specifications only, so you need to migrate the use of extensions.

However, Neptune offers several additional features that Blazegraph doesn’t offer:

High availability and scalability – Neptune stores all data across multiple Availability Zones to prevent data loss. Neptune also easily scales up and down to 15 read replicas to handle database traffic.
Automated backup and restore – Neptune is configured for automated backup with a defined backup retention period.
Monitoring – Neptune exposes a wide variety of operational and performance metrics to Amazon Cloudwatch. For more information, see Monitoring Neptune Using Amazon CloudWatch.
Security – Neptune is configured to encrypt data both in transit and at rest. Neptune also integrates with AWS Key Management Service (AWS KMS) to provide access control of encryption keys, and integrates with IAM to provide user authentication and authorization via Signature Version 4 signing of requests.
Cost control – Unlike self-managed or EC2 instances, which have fixed costs, Neptune allows for cost management by providing a variety of instance sizes, including smaller low-cost options aimed at development and testing workloads. Additionally, you can stop Neptune instances for up to 7 days, during which time you aren’t charged for database instance hours.
Streams – Neptune Streams simplifies the integration of other systems by capturing a deduplicated stream of changes made to the graph for consumption by downstream systems.

Summary

This post examined the process for migrating from an on-premises or self-hosted Blazegraph instance to a fully managed Neptune database. A migration to Neptune not only satisfies the requirements of many applications from a development viewpoint, it also satisfies the operational business requirements of business-critical applications. Additionally, this migration unlocks many advantages, including cost-optimization, better integration with native cloud tools, and lowering operational burden.

It’s our hope that this post provides you with the confidence to begin your migration. If you have any questions, comments, or other feedback, we’re always available through your Amazon account manager or via the Amazon Neptune Discussion Forums.

About the Author

Dave Bechberger is a Sr. Graph Architect with the Amazon Neptune team. He used his years of experience working with customers to build graph database-backed applications as inspiration to co-author “Graph Databases in Action” by Manning.