AWS Database Blog

Triple your knowledge graph speed with RDF linked data and openCypher using Amazon Neptune Analytics

There are numerous publicly available Resource Description Framework (RDF) datasets that cover a wide range of fields, including geography, life sciences, cultural heritage, and government data. Many of these public datasets can be linked together by loading them into an RDF-compatible database. For example:

  • DBPedia – An RDF representation of the structured data from Wikipedia
  • YAGO – A huge multilingual RDF knowledge base, derived from other datasets, including Wikipedia, WordNet, and GeoNames, which contains over 90 million statements about the real world
  • European Data Portal – Public datasets published by different institutions across the European Union, many of which are in RDF format

RDF is a framework for representing information about resources in a graph form. Unlike traditional databases that store data in tables, RDF structures data as statements of fact called triples, consisting of a subject, predicate, and object. This simple yet powerful model allows for the creation of intricate networks of data.

An example of a triple could be:

<:Kevin> <:worksFor> <:Amazon>

This single statement can evolve into a graph by adding more statements that relate to the first statement:

<:Kevin> <:worksFor> <:Amazon> 
<:Amazon> <:hasHeadOfficeIn> <:Seattle>
<:Sophie> <:worksWith> <:Kevin>

From this simple RDF graph, we can compute facts that weren’t explicitly stated by following the relationships. For example, Sophie and Kevin live in Seattle, and Sophie works for Amazon.

Because of this simple, natural way for recording data, RDF is inherently interoperable. It’s straightforward to add more statements that talk about the same concepts, and then you have a graph. RDF graph practitioners use this mechanism to link their own data to multiple publicly available RDF datasets. These landscapes of data are known as linked data.

SPARQL is the standard query language for RDF databases. Although powerful, SPARQL is sometimes criticized for missing some useful features, such as pathfinding.

openCypher is a popular graph query language (originally designed for labeled-property graphs, a different graph model from RDF) that enables intuitive and expressive querying, including pathfinding and pattern matching.

In this post, we demonstrate how to build knowledge graphs with RDF linked data and openCypher using Amazon Neptune Analytics.

Amazon Neptune Analytics is a memory-optimized graph database engine for analytics that supports running openCypher queries on RDF data. With Neptune Analytics, you can load massive RDF graphs in seconds and use all the features of openCypher, such as pathfinding, on your RDF datasets.

Solution overview

To perform the solution, we create a Neptune Analytics graph using an import task, which loads the first dataset on creation, then we load a second dataset using batch load. Then we run queries and graph algorithms in Neptune graph notebooks using openCypher.

The solution uses data from the following sources:

  • Air Routes, an RDF graph that describes airports, airlines, and the routes that airlines use to travel around the globe. The data originates from the OpenFlights Airports Database.
  • GeoNames, a public RDF geographical database that covers over 11 million place names of different types, ranging from countries and continents to roads, streams, and even some buildings. We use a subset that includes cities, countries, and their metadata.

The following diagram illustrates the solution architecture.

OC over RDF architecture

The solution uses the following services:

  • Neptune Analytics – Neptune Analytics is an analytics database engine that can quickly analyze large amounts of graph data in memory.
  • Amazon SageMaker – We run the Neptune workbench in a Jupyter or JupyterLab notebook on SageMaker. The Neptune graph notebook offers an interactive development environment (IDE) for building graph applications.
  • Amazon Simple Storage Service (Amazon S3) – Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance.

There are five high-level steps to complete:

  1. Create a Neptune Analytics graph, initializing it with Air Routes RDF data
  2. Create a Neptune graph notebook and connect it to the new graph
  3. Run openCypher queries over the new Air Routes graph in the notebook
  4. Batch load GeoNames RDF data from Amazon S3 into Neptune Analytics using the Neptune.load command
  5. Run openCypher queries and Neptune graph algorithms on the linked RDF data from both sources in the notebook

Prerequisites

You should have completed all the prerequisites to access AWS services with the AWS Command Line Interface (AWS CLI) and the AWS Management Console, with permissions to create resources in Neptune, Amazon S3, SageMaker, and related services.

No further knowledge of the tools and technologies in this post is needed to follow this walkthrough, but a basic understanding of the following is advised:

Customers are responsible for the costs of running the solution. For help with estimating costs, refer to the AWS Pricing Calculator.

Create a Neptune Analytics graph with the AWS CLI

Complete the following steps to create a Neptune Analytics graph:

  1. Create an AWS Identity and Access Management (IAM) role with Amazon S3 access. Take note of the new role ARN because you’ll need it in the next step.
  2. Start a task to create a graph using an import from Amazon S3 by entering the following command using the AWS CLI, replacing the values where necessary. This will create the graph endpoint, simultaneously loading the first of the two RDF datasets, Airroutes.
    aws neptune-graph create-graph-using-import-task \ 
        --graph-name airroutes-geo \ 
        --region <aws region> \ 
        --public-connectivity \ 
        --source "s3://aws-neptune-customer-samples-<aws region>/airroutes-rdf/airroutes.nt" \ 
        --role-arn "<new IAM role ARN from step 1>" \ 
        --format "NTRIPLES" \ 
        --blank-node-handling "convertToIri"
  3. From the response, make a note of the task ID.
    {
        "graphId": "<graph ID>",
        "taskId": "<task ID>",
        "source": " s3://aws-neptune-customer-samples-<aws region>/airroutes-rdf/airroutes.nt ",
        "format": "NTRIPLES",
        "roleArn": "<new IAM role ARN>",
        "status": "INITIALIZING"
    }
  4. Using the task ID, check the progress of the creation process by running the following command.
    aws neptune-graph get-import-task --task-identifier <task ID>

    Repeat until you see the status in the response changed to COMPLETED:

    {
        "graphId": "<graph ID>", 
        "taskId": "<task ID>", 
        "source": " s3://aws-neptune-customer-samples-<aws region>/airroutes-rdf/airroutes.nt ",
        "format": "NTRIPLES",
        "roleArn": "<new IAM role ARN>",
        "status": "COMPLETED"
    }
  1. Enter the following command to get a summary of the contents of the graph.
    aws neptune-graph get-graph-summary --graph-identifier <graph ID>

    The response contains many details about the graph, including lists of the edge labels (RDF predicates) and node labels (RDF class types). Neptune Analytics recognizes RDF classes as property graph nodeLabels, and RDF predicates as property graph edgeLabels:

    {
      "graphSummary": {
        "numNodes": 59056,
        "numEdges": 238535,
        "nodeLabels": [
          "<http://neptune.aws.com/ontology/airroutes/Airport>",
          "<http://neptune.aws.com/ontology/airroutes/Plane>" 
          ...
        ],
        "edgeLabels": [
          "<http://neptune.aws.com/ontology/airroutes/locatedIn>",
          "<http://neptune.aws.com/ontology/airroutes/hasOutboundRouteTo>"
          ...
        ],
        "totalNodePropertyValues": 211701,
        "totalEdgePropertyValues": 0
        ...
      }
    }

The following figure illustrates the Air Routes ontology and schema.

AirRoutes ontology

Create a Neptune notebook and connect it to the graph

Complete the following steps:

  1. On the Neptune console, choose Graphs in the navigation pane.
  2. Select the graph identifier of your new graph.
  3. Note the endpoint under Connectivity & Security.Graphs
  4. Create a Neptune Analytics notebook using AWS CloudFormation, providing the endpoint from the previous step for GraphEndpoint in the CloudFormation stack details.Graph endpoint
  5. On the Neptune console, choose Notebooks in the navigation pane.
  6. Select your new notebook and on the Actions menu, choose Open Jupyter.Open jupyter

Run openCypher queries over the RDF graph

In the Neptune notebook, choose the New menu, then choose Python 3.

New notebook

Select the notebook name and rename it to oc-over-rdf-blogpost.

Notebook name

List all the RDF classes or LPG label types and count their instances

Create a new cell and insert the following query. This query will find a list of labels and the number of instances for each label in the dataset.

%%oc

MATCH (n)
WITH labels(n) AS lbls
UNWIND lbls AS lbl
RETURN lbl AS Label, COUNT(*) AS Count
ORDER BY Count DESC;

Run the cell and observe the results, as shown in the following screenshot.

Class counts

This query result shows all of the RDF class types as node labels and a count of the number of instances of each of them.

For reference, the following code is the equivalent query in SPARQL.

SELECT ?class (count(?s) as ?count)
WHERE {
    ?s a ?class
}
GROUP BY ?class

Return all the properties for an RDF resource or LPG node

Create a new cell and insert the following query to find the node for London Heathrow Airport.

%%oc

PREFIX nepo: <http://neptune.aws.com/ontology/airroutes/>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>

MATCH (airport: nepo::Airport)
WHERE airport.nepo::IATA = "LHR"
RETURN airport

Run the cell and observe the results.

{
  "~id": "<http://neptune.aws.com/ontology/airroutes/resource#AIRPORT_507>",
  "~entityType": "node",
  "~labels": [
    "<http://neptune.aws.com/ontology/airroutes/Airport>"
  ],
  "~properties": {
    "<http://neptune.aws.com/ontology/airroutes/ICAO>": "EGLL",
    "<http://neptune.aws.com/ontology/airroutes/IATA>": "LHR",
    "<http://neptune.aws.com/ontology/airroutes/longitude>": -0.461941,
    "<http://neptune.aws.com/ontology/airroutes/latitude>": 51.4706,
    "<http://www.w3.org/2000/01/rdf-schema#label>": "London Heathrow Airport"
  }
}

The result shows the node for London Heathrow Airport, its labels, and properties.

For reference, the following code is the equivalent query in SPARQL.

SELECT DISTINCT ?p ?o WHERE {
  BIND("LHR" AS ?iata)
  {
    ?airport ?p ?o;
      nepo:IATA ?iata.
    FILTER(ISLITERAL(?o))
  }
  UNION
  {
    BIND(rdf:type AS ?p)
    ?airport ?p ?o;
      nepo:IATA ?iata.
  }
}

Explore the Air Routes RDF graph using openCypher

We have created a notebook with many example openCypher queries to help you explore running openCypher queries over the Air Routes RDF graph.

In the Neptune workbench, choose the Jupyter logo to the root folder. To open the 01-Air-Routes notebook, navigate to Neptune/02-Neptune-Analytics/04-OpenCypher-Over-RDF/01-Air-Routes.ipynb.

Air Routes notebook

If you have been following this post, you already loaded the Air Routes and GeoNames RDF graphs, so you can ignore the load steps.

Walk through the rest of the notebook, running queries and reviewing the responses and visual network graph diagrams.

Load GeoNames RDF data into the graph

Load another RDF graph, which is a subset of GeoNames, including cities and countries to create a graph of linked data.

In the Neptune workbench, choose the Jupyter logo to navigate to the root folder and select the oc-over-rdf-blogpost notebook.

oc-over-rdf-blogpost notebook

Create a new cell and run the following query to load the GeoNames graph.

%%oc

CALL neptune.load(
    {
        source: "s3://aws-neptune-customer-samples-<aws region>/geonames-rdf/geonames-cities-countries.nt",
        region: "<aws region>",
        format: "ntriples",
        failOnError: true,
        blankNodeHandling: "convertToIri"
    }
)

You can now run openCypher queries that return linked data from the Air Routes and GeoNames graphs.

Run the first cell again to return a list of labels and the number of instances for each label in the dataset.

%%oc

MATCH (n)
WITH labels(n) AS lbls
UNWIND lbls AS lbl
RETURN lbl AS Label, COUNT(*) AS Count
ORDER BY Count DESC;

The result of the query now contains GeoNames and Air Routes data, as shown in the following table.

Label Count
<http://neptune.aws.com/ontology/airroutes/Route> 37505
<http://www.geonames.org/ontology#Feature> 28696
<http://schema.org/City> 28444
<http://neptune.aws.com/ontology/airroutes/Airport> 7811
<http://neptune.aws.com/ontology/airroutes/City> 6955
<http://neptune.aws.com/ontology/airroutes/Airline> 6162
<https://www.geonames.org/ontology#Code> 690
<http://www.w3.org/2002/07/owl#Restriction> 636
<http://neptune.aws.com/ontology/airroutes/Country> 356
<http://schema.org/Country> 252

Run a linked data query

Create a new cell and run the following query to find countries from both datasets that have the same ISO country code.

%%oc

PREFIX nepo: <http://neptune.aws.com/ontology/airroutes/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX gn: <http://www.geonames.org/ontology#>
PREFIX schema: <http://schema.org/>

MATCH (gn_country:schema::Country), (ar_country:nepo::Country)
WHERE gn_country.gn::countryCode = ar_country.nepo::ISO
RETURN gn_country.gn::name, ar_country.rdfs::label

The result shows countries from both datasets linked by ISO country code, as shown in the following table.

# gn_country.gn::name ar_country.rdfs::label
1 Saudi Arabia Saudi Arabia
2 Mayotte Mayotte
3 Mozambique Mozambique
4 Madagascar Madagascar
5 Afghanistan Afghanistan
6 Pakistan Pakistan
7 Bangladesh Bangladesh
8 Turkmenistan Turkmenistan
9 Tajikistan Tajikistan
10 Sri Lanka Sri Lanka

Explore the Air Routes GeoNames RDF graph using openCypher

In the Neptune workbench, choose the Jupyter logo to return to the root folder. To open the 02-Air-Routes-GeoNames notebook, navigate to Neptune/02-Neptune-Analytics/04-OpenCypher-Over-RDF/02-Air-Routes-GeoNames.ipynb.

AirRoutes-GeoNames notebook

Step through the 02-Air-Routes-GeoNames notebook to learn how to create more complex queries that span the two datasets.

Clean up

To clean up the resources you created, complete the following steps:

  1. To delete the Neptune Analytics graph, in the AWS CLI, remove delete graph protection using the update-graph command:
    aws neptune-graph update-graph \
        --graph-id <graph ID> --no-deletion-protection \
        --region <aws region>
  2. Then run the delete-graph command:aws neptune-graph delete-graph --graph-id <graph ID>
  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the stack that you used to create the Neptune Analytics notebook.
  3. On the Actions menu, choose Delete stack.
  4. Confirm the deletion when prompted.

Summary

In this post, we showed you how to load an RDF dataset into a Neptune Analytics graph using the AWS CLI, and how to query that data with openCypher. We then showed how to load a public dataset and query both datasets in a linked data graph.

Now you can load RDF graphs into Neptune Analytics, link them to your own data, and even run graph algorithms over all the linked data together. There are many large, public RDF datasets available, and you can load massive RDF datasets into Neptune Analytics in seconds.

Visit the openCypher over RDF demo application to see an example architecture, run analytics queries, and browse a network graph visualization that uses the same Air Routes RDF model.


About the Author

Charles IvieCharles Ivie is a Senior Graph Architect with the Amazon Neptune team at AWS. As a highly respected expert within the knowledge graph community, he has been designing, leading, and implementing graph solutions for over 15 years.