AWS Database Blog

Build and deploy knowledge graphs faster with RDF and openCypher

Amazon Neptune Analytics now supports openCypher queries over RDF graphs.

When you build an application that uses a graph database such as Amazon Neptune, you’re typically faced with a technology choice at the start: There are two different types of graphs, Resource Description Framework (RDF) graphs and labeled property graphs (LPGs), and your choice of which to use will determine which query languages you can use. RDF graphs use SPARQL as their query language, and for LPGs the query languages are Gremlin and openCypher. Users and developers who are new to graph technology are often confused about having to make this choice, and those who have more experience have wondered why it isn’t possible to use openCypher over RDF. The same choice applies when it comes to data ingestion: users may want to ingest a combination of LPG and RDF data into the same graph. This choice that users have to make between graph models is a manifestation of a rift in the entire graph industry. Reasons for this division are manifold: technology limitations, lack of awareness of the other technology, and sometimes an almost religious conviction. RDF, with its standardized serialization formats, global identifiers, and the availability of Linked Open Data sets, is of particular value for data architects who seek to build, integrate, and interchange graph data. Application development teams, on the other hand, often prefer LPG query languages to interact with graphs, due to their intuitive syntax, the maturity of developer ecosystems (client drivers, programming language integration, and so on), and graph-specific features such as built-in support for path extraction and algorithms.

Amazon Neptune is a scalable, managed graph database service that supports both graph models. In the Neptune team it has been our goal to make the adoption of graph technology easier for our customers, and we wanted to remove the limitations of the technology choices mentioned above. This is why we started OneGraph as an initiative to bring the two worlds together and allow our customers to benefit from the best of both worlds. The initiative aims to provide graph interoperability, first the ability to use graph query languages regardless of your choice of the graph model, and ultimately making the aforementioned division less and less significant. This goal is challenging, but we’re making steady progress towards the fulfillment of this vision. Note that we aren’t alone in wanting to better align RDF and LPGs: The World Wide Web Consortium’s ongoing work on RDF-star will also contribute to this goal and was originally motivated by the desire to add features to RDF that LPGs already provide.

In Neptune Analytics, we now offer the ability to run openCypher queries over RDF graphs. This new feature is attractive for several of reasons:

  • Knowledge graphs invariably use features or concepts available in RDF, and while you could implement those with LPGs, RDF gives you those natively. Examples of these include ontologies (for which you need an ontology definition language) and external data sources (for which you need easy graph merging, something that is part of the RDF specification). Introducing the ability to access RDF data from an LPG application makes these RDF features readily available. The new feature makes it possible to access all the “Linked Open Data” datasets out there (such as Wikidata, the RDF version of Wikipedia).
  • SPARQL lacks the ability to do path discovery, the capability to identify which path was taken after doing a path traversal. This is something we constantly hear Neptune customers ask for. SPARQL and RDF would also benefit from better support for composite datatypes.
  • Sometimes, there are situations where RDF- and LPG-based systems need to be integrated (for example, a corporate merger), and this integration can now be done in data.

Also, openCypher greatly influenced the just-released ISO GQL standard, and there is a clear path from openCypher to GQL, something that we believe will become commonplace in the industry in the next few years. As a query language, openCypher is declarative just like SPARQL, but its syntax was designed to be more familiar: openCypher queries are like ASCII art representing graph structures and the conditions to be matched in a graph.

In this post, we show you how to use openCypher with RDF graphs.

Writing openCypher queries

To write openCypher queries that operate over RDF, slight extensions to the query syntax were needed. These are mostly just syntactic conventions, and this allows the reuse of standard tooling such as parsers and syntax highlighters. The following example demonstrates the correspondence between SPARQL and openCypher. First, here is a SPARQL query to access our Air Routes sample data (airports, airlines, routes, and so on, here – note: this file is large, 75 MB). The ontology for Air Routes looks like the following figure:

For now, we’re interested in airports and routes. We will find the name and the International Air Transport Association (IATA) airport code of an airport whose International Civil Aviation Organization (ICAO) airport code is “KMHT” (Manchester-Boston Regional Airport in New Hampshire, USA):

PREFIX nepo: <http://neptune.aws.com/ontology/airroutes/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?name ?iata_code {
    ?airport a nepo:Airport ;
        nepo:ICAO "KMHT" ;
        nepo:IATA ?iata_code ;
        rdfs:label ?name
}

In openCypher the same query can now be expressed as the following:

PREFIX nepo: <http://neptune.aws.com/ontology/airroutes/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

MATCH (airport: nepo::Airport)
WHERE airport.nepo::ICAO = "KMHT"
RETURN airport.rdfs::label, airport.nepo::IATA

With our small syntax extensions, writing queries against RDF data is easy and feels natural. Note how we chose to write qualified names (prefix-shortened International Resource Identifiers or IRIs) using a double-colon syntax so as not clash with existing openCypher syntactic conventions. We have also added a way to add namespace prefix declarations, and we use the same syntax for those as we do in SPARQL. It’s also possible to use long-form IRIs instead of their prefix-shortened versions, enclosed in backticks (`) like this:

MATCH (airport: `<http://neptune.aws.com/ontology/airroutes/Airport>`)
WHERE airport.`<http://neptune.aws.com/ontology/airroutes/ICAO>` = "KMHT"
RETURN airport.`<http://www.w3.org/2000/01/rdf-schema#label>`,
       airport.`<http://neptune.aws.com/ontology/airroutes/IATA>`

Now, let’s try a SPARQL query that will give us all the airports that are destinations of routes that originate from KMHT:

PREFIX nepo: <http://neptune.aws.amazon.com/ontology/airroutes/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?destination_name {
    ?airport a nepo:Airport ; nepo:ICAO "KMHT" .
    ?route nepo:source ?airport ; nepo:destination ?destination .
    ?destination rdfs:label ?destination_name
}

Using openCypher, the query looks like this:

PREFIX nepo: <http://neptune.aws.com/ontology/airroutes/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

MATCH (airport: nepo::Airport {nepo::ICAO: "KMHT"})
      -[: nepo::source]-(: nepo::Route)-[: nepo::destination]
      -(destination: nepo::Airport)
RETURN destination.rdfs::label AS destination_name;

Finally, let’s find all the airports you would have to fly through to get from KMHT to EFHK (Helsinki International Airport in Finland). This is considered path discovery and isn’t possible with a single SPARQL query.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX nepo: <http://neptune.aws.com/ontology/airroutes/>

MATCH (origin:nepo::Airport {nepo::IATA: 'LHR'})
MATCH (destination:nepo::Airport {nepo::IATA: 'HEL'})

MATCH p=(origin)-[:nepo::hasOutboundRouteTo*..2]-(destination)

UNWIND nodes(p) as airportStops
WITH p, collect(airportStops.rdfs::label) as `Routes:LHR-HEL`
RETURN `Routes:LHR-HEL`
Routes:LHR-HEL
[‘London Heathrow Airport’, ‘Helsinki Vantaa Airport’]
[‘London Heathrow Airport’, ‘Helsinki Vantaa Airport’]
[‘London Heathrow Airport’, ‘Split Airport’, ‘Helsinki Vantaa Airport’]
[‘London Heathrow Airport’, ‘Barcelona International Airport’, ‘Helsinki Vantaa Airport’]
[‘London Heathrow Airport’, ‘Adolfo Suárez Madrid–Barajas Airport’, ‘Helsinki Vantaa Airport’] …

Modifying RDF with openCypher

Modifying RDF data with openCypher in Neptune Analytics follows the same pattern as modifying LPG data in openCypher, while allowing for the extension of using namespace prefixes in the same way as is mentioned above. Consider the following query, where we find paths between source and destination Airports, hopping over a Route node.

PREFIX nepo: <http://neptune.aws.com/ontology/airroutes/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

MATCH (airport: nepo::Airport {nepo::IATA: 'NWI'})
      -[: nepo::source]-(: nepo::Route)-[: nepo::destination]
      -(destination: nepo::Airport {nepo:IATA: 'HEL'})
RETURN destination.rdfs::label AS destination_name;

We can use a modification query to create a new edge nepo::hasOutboundRouteTo, between the source and destination Airports:

PREFIX nepo: <http://neptune.aws.com/ontology/airroutes/> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 

MATCH (n1: nepo::Airport)
      <-[:nepo::source]-(r: nepo::Route)-[:nepo::destination]->
      (n2: nepo::Airport)
CREATE (n1)-[:nepo::hasOutboundRouteTo]->(n2)

We can now use a simpler variable length path query, and return the paths:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX nepo: <http://neptune.aws.com/ontology/airroutes/> 

MATCH (origin:nepo::Airport {nepo::IATA: 'NWI'})
MATCH (destination:nepo::Airport {nepo::IATA: 'HEL'})

MATCH p=(origin)-[:nepo:: hasOutboundRouteTo *..2]-(destination)
RETURN p

Returning paths in this manner rather than as tabular responses can be beneficial, as graph visualization tools often recognize this format for rendering diagrams. For example, the open source projects Neptune Workbench and Graph Explorer use this format for visualizing graphs as network diagrams.

Graph algorithms

It’s now also possible to invoke Neptune Analytics graph algorithms on RDF data. The following is an example where we run the PageRank algorithm (note that we use the new relation created in the previous section):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX nepo: <http://neptune.aws.com/ontology/airroutes/>

MATCH (airport)
CALL neptune.algo.pageRank(
       airport, {
         edgeLabels: [nepo::hasOutboundRouteTo],
         vertexLabel: nepo::Airport
       }
     ) 
YIELD rank WHERE rank > 0.0035
RETURN airport.rdfs::label, rank 
ORDER BY rank
airport.rdfs::label rank
Hartsfield Jackson Atlanta International Airport 0.00396
Atatürk International Airport 0.00367
Chicago O’Hare International Airport 0.00363
Denver International Airport 0.00361
Dallas Fort Worth International Airport 0.00355
Domodedovo International Airport 0.00345
Charles de Gaulle International Airport 0.00334
Frankfurt am Main Airport 0.00326
Beijing Capital International Airport 0.00323
Amsterdam Airport Schiphol 0.00309
Dubai International Airport 0.00305

Summary

We believe the new openCypher-over-RDF functionality will not only enable more expressive queries for RDF data, but users might also prefer openCypher to SPARQL for other reasons (for example, the aforementioned openCypher ASCII art syntax seems to appeal to many users). Neptune Analytics, our service for running large-scale graph algorithms, already uses OneGraph, and now the next step in our journey is ready: running openCypher queries and analytical algorithms over RDF graphs. Going forward, as we develop the Neptune service, our broad goal is to offer more choices for users by enabling cross-use of query languages across the different graph models. We think the existing division is an obstacle to wider adoption of graph databases and graph computing in general.

Stay tuned for our next blog post that shows how you can mix RDF and LPG data in a single application.

Acknowledgements

The authors wish to thank Kevin Phillips and Andreas Steigmiller for their review and helpful comments.


About the authors

Ora Lassila is a Principal Graph Technologist in the Amazon Neptune graph database group. He has a long experience with graphs, graph databases, ontologies, and knowledge representation. He was a co-author of the original RDF specification, and currently chairs the W3C RDF-star Working Group. He holds a PhD in Computer Science, but his daughters do not think he is a real doctor, the kind who helps people. Twitter: @oralassila

Michael Schmidt is a Principal Software Development Engineer with Amazon Web Services. In his role as lead architect for Neptune’s query optimization stack, he works both with AWS customers and internal development teams to improve Amazon Neptune’s performance, scalability, and user experience.

Willem Broekema is a Senior Software Development Engineer with the Amazon Neptune team. He believes scalable graph databases and expressive query languages are a winning combination. In his free time, he likes to play traditional jazz on the sousaphone.

Charles Ivie is a Senior Graph Architect with the Amazon Neptune team at AWS. As a highly respected expert within the knowledge graph community, he has been designing, leading, and implementing graph solutions for over 15 years.