AWS Database Blog
Benefitting from SPARQL 1.1 Federated Queries with Amazon Neptune
Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. Neptune supports the W3C’s graph model RDF, and its query language SPARQL. SPARQL 1.1 Federated Query specifies an extension to SPARQL for running queries distributed over different SPARQL endpoints.
In this post, I show you how to use SPARQL 1.1 Federated Query in Neptune to get data about soccer teams in the UK from an external dataset, DBpedia (a well-known public dataset of Wikipedia data). Using the DBpedia publicly accessible SPARQL endpoint, I link the data from DBpedia to data that I add to the Neptune cluster.
You can use SPARQL 1.1 Federated Query to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. With Federated Query, you can do the following:
- Get data from multiple SPARQL 1.1 endpoints and join the results into a single result set for further analysis.
- Use AWS Identity and Access Management (IAM) policies or VPC network configuration to allow different users to access different datasets, which enables more fine-grained user access.
- Shard your data into different Neptune instances for performance or compliance purposes. You may have some data on a large cluster that many users need to access frequently, and other data on a smaller cluster that’s accessed less frequently.
- Have one Neptune database that contains all the real data, and another dataset that contains metadata about the real data, such as where it came from, who created it, or when it was created (perhaps using the PROV-O ontology).
- Partition your data across several Neptune clusters for performance, compliance, or security reasons but still allow users to query the data together for their application.
- Refer to public SPARQL endpoints from a Neptune cluster to augment information stored in Neptune.
To federate a query to get data from a SPARQL 1.1 endpoint, you have to use the SERVICE keyword in your SPARQL query. See the following code:
For more information about using the SERVICE keyword in SPARQL 1.1, see W3C Recommendation 21 March 2013.
This works in Neptune because Neptune implements the W3C standards for SPARQL 1.1 and RDF1.1. You can combine your data in your Neptune database with that of other databases that conform to these same standards.
Solution overview
For this use case, I combine multiple databases, DBpedia and Neptune, to show you which airports you can use to travel to see your favorite soccer team.
The post covers the following topics:
- Setting up your Neptune cluster. If you have an existing cluster, you need to make sure it has the correct VPC configuration and can perform federated queries to external endpoints.
- If you don’t have an existing cluster, you can create one using AWS CloudFormation, with a Jupyter notebook for querying, and the relevant VPC network configuration to enable the Neptune cluster to communicate with the outside world.
- Running a SPARQL 1.1 query in Neptune that uses the SERVICE keyword to get data from the external dataset DBpedia.
- Linking your internal Neptune data to the response from DBpedia.
- Taking performance into account when using federated queries.
- Testing VPC networking.
- Federated queries across Neptune clusters in the same VPC.
- Federated queries across Neptune clusters in different VPCs.
- Federated queries across Neptune and other external SPARQL 1.1 endpoints.
Configuring your Neptune cluster
If you already have a Neptune cluster and Workbench you want to use, run the following query in your Workbench (Neptune Jupyter Notebook) to make sure you can run federated queries:
If your query returns an error, you may need to create the correct VPC network settings to allow your Neptune cluster to send outbound requests. Complete the following:
- Have a public subnet; you may find it easiest to create a new one
- Have a NAT gateway linked to your public subnet
- Configure your existing route table to target your NAT Gateway
- Create a new route table targeting the Internet Gateway
- Associate your new subnet with your new route table
If you have enabled IAM database authentication for your Neptune cluster, you must take this into consideration.
For more information about IAM security in Neptune, see
Identity and Access Management in Amazon Neptune.
When you can run the preceding query successfully, you can skip the next section and start running your queries.
Creating your Neptune cluster with AWS CloudFormation
The solution presented in this post creates a stack using AWS CloudFormation with the following resources:
- A Neptune VPC with:
- Three private subnets
- One public subnet
- An Internet Gateway
- Appropriate subnet groups
- A Neptune cluster with at least one writer/reader instance
- An Amazon SageMaker Jupyter notebook instance with Neptune
The easiest way to create a Neptune cluster with all the right VPC network configuration to allow for federated queries is to use an AWS CloudFormation script. For instructions, see Creating a New Neptune DB Cluster Using AWS CloudFormation or Manually.
When you choose your notebook instance type, a Jupyter notebook instance that you can use to run queries is created for you. For more information about notebooks, see Using the Neptune Workbench with Jupyter Notebooks.
Running a query using the SERVICE keyword to get data from DBpedia
The following diagram illustrates a simple federated query to retrieve data from DBpedia.
To run this query, complete the following steps:
- On the Neptune console, choose Notebooks.
- Choose your notebook and choose Open notebook.
- From the New drop-down menu, choose python3 or conda_python3.
You’re now ready to run your federated SPARQL1.1 query. For this post, we write a query that retrieves some data from DBpedia’s SPARQL 1.1 endpoint, using the SERVICE keyword. Specifically, we retrieve some information about airports, their names, and their IATA identifiers.
We then filter the query to make sure we only get back names English, so that the result set is smaller and easier to work with.
This SERVICE call sends a request to the public DBpedia SPARQL 1.1 endpoint, so there is no guarantee that it’s fully functional when you make the request. To check DBpedia’s current status, you can visit http://live.dbpedia.org/live/
.
If you have any errors when running this query, there could be something wrong with the VPC setup. See the earlier section Configuring your Neptune cluster for help with VPC configuration.
- In the Jupyter notebook editor, enter the following query:
In this use case, we use the SERVICE keyword to access the DBpedia SPARQL 1.1 endpoint to retrieve data about an airport with the location identifier
“MHT”
. - Run the query by choosing Run pressing CTRL + ENTER.
You should see a response similar to the following screenshot:
The entirety of the response is actually coming from outside your Neptune cluster. The Neptune engine is using the SERVICE keyword to call an external SPARQL 1.1 endpoint at DBpedia. See the following code:
Linking your internal data to the response from DBpedia
Now that you have a federated query running, you can add your data to the Neptune cluster and link it to the response from DBpedia.
You first need to add some data that you can link to the DBpedia data to your Neptune database.
Run the following SPARQL 1.1 query, which adds some data into your Neptune database about soccer teams in the UK, and the city in which they’re based (this doesn’t insert any data into the DBpedia remote database):
You may have already noticed that some of the data uses the same URIs as those in the DBpedia data. For example, the following code shows how the dbo
and dbr
prefixes are used:
Run the following query to see the two datasets linked together in the result set:
You should see the following response: two of the sports teams are in the same city as London Heathrow Airport.
To look at the query in more detail, you can bind the variable ?airportID
to “LHR”
. See the following code:
You can also try changing the value to another airport, such as “MAN” for Manchester. The following screenshot shows the changed results.
You can also change the BIND parameter to a FILTER. The following code comments the BIND and adds a FILTER:
This returns the same result as before.
Taking performance into account when using federated SERVICE calls
The following query is inefficient because it retrieves 8,000 airports from the SERVICE call before filtering the results to find “LHR”
:
This is an inefficient query because the SERVICE call acts as a subquery that returns all the data it can find from within the SERVICE call enclosure (an enclosure is defined by parenthesis). See the following code:
The result of this service call is then used by the enclosing parent enclosure as part of the overall dataset, as shown in the following code:
It’s much more efficient to put the filter inside the service call enclosure, which makes sure that the remote SERVICE does the filtering before the response is sent back to the parent enclosure. The following code filters from inside a service call enclosure:
You could put the filter outside the SERVICE enclosure, in the parent enclosure. The query arrives at the same response, but this is considered a bad practice for the following reasons:
- Before you filter the response for
“LHR”
, you get all the results from the external SERVICE call to DBpedia, when you only need a very small subset. You then have to filter all the data inside your Neptune cluster. - This makes you a poor citizen of the Linked Open Data community because you’re using resources that others could use.
- If you use these public SPARQL endpoints too much and unwisely, you may hit quotas and be suspended from them. For more information about endpoint service limits, see Public SPARQL Endpoint.
Be sure to take care when building federated SPARQL queries; it can have a massive impact on performance, query times, and costs.
Filter from inside the SERVICE enclosure wherever possible, which ensures you only get back the data you need from the external SPARQL 1.1 endpoint. See the following code:
Testing VPC networking
To make federated queries using the SERVICE keyword across Neptune clusters, you must configure your VPC networking correctly.
You can test that VPC networking is configured correctly by running a SPARQL query that federates to itself (self-federation). Run the following query in your Jupyter notebook and as long as an error isn’t returned, your current Neptune instance can communicate within its own VPC:
Federated queries across Neptune clusters in the same VPC
If the Neptune clusters are located in the same VPC, you likely don’t need to do any additional network configuration. However, it’s important to understand that your Neptune cluster’s security group must have an inbound rule to connections to the exposed Neptune port (defaulted to 8182) within the same VPC.
If A and B are in the same VPC, and assuming the default port is configured for Neptune, you need the following connections:
- Client to A – A must allow 8182 from Client
- Client to B – B must allow 8182 from Client
- A to B – Federation B must allow 8182 from A
The following diagram illustrates this architecture.
Federated queries across Neptune clusters in different VPC’s
If Neptune clusters are not in the same VPC, which is often the case, you need to set up VPC Peering.
When your network configuration is ready, you can use the SERVICE keyword to link data from different Neptune clusters, just as you can with any other SPARQL 1.1 endpoint. The following diagram illustrates this architecture.
Federating across Neptune and external SPARQL1.1 endpoints
In this use case, you may have another Neptune database containing data about routes between airports. The following diagram shows how you can combine data from the soccer dataset with the air routes data from the second Neptune database and the airport data from DBpedia.
You can run your query from any of your Neptune clusters, but remember that the federated endpoints in a SERVICE enclosure should be outside where you run your query. See the following code:
Summary
This post examined how to use SPARQL 1.1 Federated Query with Neptune. You should now be able to combine data from as many SPARQL 1.1 endpoints as you like.
Using Neptune allows you to benefit from the W3C recommendations of RDF and SPARQL for your application and satisfies the operational business requirements of business-critical applications, including cost optimisation, better integration with native cloud tools, and lowering operational burden.
It’s our hope that this post provides you with the confidence to get started benefiting from SPARQL 1.1 Federated Query with Neptune. If you have any questions, comments, or other feedback, share your thoughts on the Amazon Neptune Discussion Forums.
About the Author
Charles Ivie is a Senior Graph Architect with the Amazon Neptune team at AWS. He has been designing, implementing and leading solutions using knowledge graph technologies for over ten years.