Using SPARQL explain to understand query execution in Amazon Neptune

Customers continue to want greater visibility and control over the services they use within AWS. When it comes to our database services, customer requests typically revolve around providing greater insights into the query optimization and processing within a given database. Database developers and administrators are mostly already familiar with the idea and use of database query execution plans. Motivated by customer discussions, Amazon Neptune has now added the addition of a SPARQL query explain feature.

Amazon Neptune is a fast, reliable, fully managed graph database, optimized for storing and querying highly connected data. It is ideal for online applications that rely on navigating and leveraging connections in their data.

Amazon Neptune supports W3C Resource Description Framework (RDF) graphs that can be queried using the SPARQL query language. It also supports Apache TinkerPop property graphs that can be queried using the Gremlin graph traversal and query language.

In this blog post, we dive deep into the new SPARQL explain feature and its use. We’ve also provided an example workload and configuration at the end of this post for those that want to begin experimenting with SPARQL explain today.

Understanding a SPARQL query’s runtime behavior using explain

When a SPARQL query is submitted to a Neptune cluster, the database engine forwards the query to a SPARQL query optimizer. This generates a query plan based on available statistics and heuristics. The optimizer breaks down the query by its individual triple patterns and the connection operators, and automatically reorders them to provide for optimal execution. With this type of optimization, the query developer does not need to think about the best order of evaluating the query.

In some cases, you may want to gain more insights into the evaluation order of triple patterns (and more generally, the execution plan) that has been chosen by the optimizer. This is where the new SPARQL explain feature comes into play, as it allows you to inspect the generated evaluation plan to understand its execution order.

Getting the query explain output is as easy as adding an additional parameter “explain=<MODE>” to the HTTP request. The following curl command (with variables $NEPTUNE_CLUSTER_ENDPOINT and $NEPTUNE_CLUSTER_PORT pointing to the endpoint and port of a Neptune cluster) allows us to send this query to Neptune. In this case, the SPARQL query is being passed via a text file called query1.sparql.

curl -s http://$NEPTUNE_CLUSTER_ENDPOINT:$NEPTUNE_CLUSTER_PORT/sparql?explain=dynamic \
     -d "@query1.sparql" \
     -H "Content-type: application/sparql-query" \
     -H "Accept: text/plain"

query1.sparql

PREFIX prop: <http://kelvinlawrence.net/air-routes/vocab/prop#> 
PREFIX airport: <http://kelvinlawrence.net/air-routes/data/airport#> 
SELECT DISTINCT ?via WHERE {
  ?route1 prop:from airport:FRA . 
  ?route1 prop:to ?via . 
  ?route2 prop:from ?via . 
  ?route2 prop:to airport:SEA .
}

NOTE: The query above is using a dataset containing commercial air routes. It extracts 1-stop connections from Frankfurt airport to Seattle airport. We discuss how the query works later in this post.

By adding “explain=dynamic” to the end of our HTTP request, we receive the following output. This is a breakdown of the submitted SPARQL query and its execution within Neptune:

╔════╤════════╤════════╤═══════════════════╤═══════════════════════════════════════════════════╤══════════╤══════════╤═══════════╤════════╗
║ ID │ Out #1 │ Out #2 │ Name              │ Arguments                                         │ Mode     │ Units In │ Units Out │ Ratio  ║
╠════╪════════╪════════╪═══════════════════╪═══════════════════════════════════════════════════╪══════════╪══════════╪═══════════╪════════╣
║ 0  │ 1      │ -      │ SolutionInjection │ solutions=[{}]                                    │ -        │ 0        │ 1         │ 0.00   ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 1  │ 2      │ -      │ PipelineJoin      │ pattern=distinct(?route2, prop:to, airport:SEA)   │ -        │ 1        │ 118       │ 118.00 ║
║    │        │        │                   │ joinType=join                                     │          │          │           │        ║
║    │        │        │                   │ joinProjectionVars=[?route2]                      │          │          │           │        ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 2  │ 3      │ -      │ PipelineJoin      │ pattern=distinct(?route2, prop:from, ?via)        │ -        │ 118      │ 118       │ 1.00   ║
║    │        │        │                   │ joinType=join                                     │          │          │           │        ║
║    │        │        │                   │ joinProjectionVars=[?route2, ?via]                │          │          │           │        ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 3  │ 4      │ -      │ PipelineJoin      │ pattern=distinct(?route1, prop:to, ?via)          │ -        │ 118      │ 10030     │ 85.00  ║
║    │        │        │                   │ joinType=join                                     │          │          │           │        ║
║    │        │        │                   │ joinProjectionVars=[?route1, ?via]                │          │          │           │        ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 4  │ 5      │ -      │ PipelineJoin      │ pattern=distinct(?route1, prop:from, airport:FRA) │ -        │ 10030    │ 45        │ 0.00   ║
║    │        │        │                   │ joinType=join                                     │          │          │           │        ║
║    │        │        │                   │ joinProjectionVars=[?route1]                      │          │          │           │        ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 5  │ 6      │ -      │ Distinct          │ vars=[?via]                                       │ -        │ 45       │ 45        │ 1.00   ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 6  │ 7      │ -      │ Projection        │ vars=[?via]                                       │ retain   │ 45       │ 45        │ 1.00   ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 7  │ -      │ -      │ TermResolution    │ vars=[?via]                                       │ id2value │ 45       │ 45        │ 1.00   ║
╚════╧════════╧════════╧═══════════════════╧═══════════════════════════════════════════════════╧══════════╧══════════╧═══════════╧════════╝

Hint: In this blog post, we always use text/plain, which generates an ASCII-based, tabular serialization. Neptune also supports HTML based output and a CSV serialization (which can conveniently be pasted into spreadsheets for further investigation). See the documentation for more details.

The explain parameter can be set to either static or dynamic. In the preceding example, the explain mode dynamic indicates that we’re also interested in dynamic aspects of the evaluation, namely the number of solutions flowing through the plan at runtime. The static parameter prints a limited view that summarizes the structure of the plan.

SPARQL explain output overview

A Neptune query plan can be understood as a pipeline of operators. Operators receive solutions from one or more downstream operators and forward solutions to (at most two) upstream operators. Our example starts with a SolutionInjection operator, followed by a sequence of PipelineJoin operations, one for each triple pattern in the query, and finally computes a Distinct, a Projection, and a TermResolution operation. The columns Out #1 and Out #2 describe how operators are connected to each other. They contain the IDs of the operators to which the given operator forwards its output. In our example, the operator pipeline is perfectly linear, each operator forwards to the next operator. But for more complex queries, Neptune may choose non-linear plans. An example of this is a Copy operator, which forward results into subplans representing the different parts of a SPARQL UNION.

Breaking it all down

Going into more details, here is how the plan above works (consult the SPARQL explain operator reference for more details about operators and their arguments). We refer to each row in the explain output by its execution ID contained within the first column from the left.

Operator ID 0

The first step in query evaluation is a SolutionInjection step. In our example, this step injects a single, so-called “universal” solution {}, which is expanded to the final result throughout the evaluation process. This step does not contribute much information in this case, it is just telling us that we do not inject any static solutions. We would do this, for instance, when SPARQL queries contain an outer VALUES clause that injects variable bindings to start out with. As indicated by columns Units In and Units Out, this step contains no input and forwards the single, universal solution.

Operator IDs 1–4

The following four steps are so-called PipelineJoin operators. They reflect the sequence in which the triple patterns in the query are evaluated. Most important here is the pattern argument, which indicates the triple pattern that is evaluated by the join. The first operator, for instance, evaluates the triple pattern (?route2, prop:to, airport:FRA), the second triple pattern (?route2, prop:from, ?via), and so on. In all cases, the patterns are prefixed with a distinct, indicating that we are interested in distinct solutions only.

NOTE: This implementation of distinct uses SPARQL’s default graph semantics. While it is not relevant in our example, the same matching triple could exist in multiple named graphs, in which case we would only extract one solution (as implied by the distinct). See the documentation on how Neptune treats named vs. default graphs for more information.

Comparing the sequence of PipelineJoins against the order of triple patterns, as specified in our query, you can see that the optimizer has changed the evaluation order. The approach chosen in the query plan is to start out with the destination, Seattle, and traverse the graph “backwards”, via the intermediate stop, to the starting Frankfurt airport. Intuitively speaking, the reason for this reordering is that Seattle is the smaller airport (in terms of connections), so traversing the graph starting from Seattle results in a smaller search space.

One important thing to know about Neptune’s PipelineJoin is that variable bindings known from previous operators are “substituted” into triple patterns for subsequent operations. Per our example:

Operator #1: The first PipelineJoin receives the universal solution as input, in which no variables are bound. In the absence of any known variable bindings, it simply looks up all (distinct) triples in the database matching the pattern (?route2, prop:to, airport:SEA). Because of this operation, variable ?route2 is bound to all routes that have airport:SEA as their target.
The Units Out column in explain indicates that there are 118 such routes. The PipelineJoin operator forwards all these routes (represented by the binding for the ?route2 variable) to the subsequent operator.

Operator #2: The next operator implements the triple pattern (?route2, prop:from, ?via). Column Units In tells us that there are 118 bindings flowing in, representing the different ?route2 bindings found in the previous step. The variables from these bindings are now substituted successively in to the triple patterns, thus extracting the ?via airports from which the given ?route2 started out. Given that a route represents a single connection between two airports, we also have 118 solutions flowing out of this operator. In each of these solutions, the ?via variable is bound to the candidate intermediate airport.
Operators #3: evaluating the pattern (?route1, prop:to, ?via), we now look for all routes ?route1 ending in airport ?via, which was bound in the previous pattern. For the 118 ?via airports, we find 10003 such routes, indicating that each of these airports in average has about ~85 incoming connected airports.
Operators #4: The last PipelineJoin implements the triple pattern (?route1, prop:from, airport:FRA). In the input that is flowing into the operator, the variable ?route1 has just recently been bound. As usual, we substitute in these candidates, thus evaluating a sequence of fully bound triple patterns. While the evaluation of such a fully bound pattern does not introduce new bindings, its evaluation pattern can be understood as “filtering” for those solutions for which the triple pattern exists in the database. In this case, we filter the set of candidate bindings to retain those that actually started in Frankfurt. Up to this point, we were just exploring the graph backwards, to retrieve all airports from which we can reach Seattle with one intermediate stop. Unsurprisingly, by applying this filter constraint, the operator brings our intermediate solution size down from 10030 Units In to only 45 Units Out.

Operator IDs 5–6

Here we compute a Distinct (operator number 5) and Projection (operator number 6) over the designated output variable ?via. The Distinct operator essentially removes all other variables from the solutions (as they have been collected throughout the PipelineJoin sequence) and only retains the distinct values found for the ?via variable.

Operator ID 7

For the sake of performance, throughout the evaluation process Neptune operates on internal identifiers for the terms (such as URIs and string literals). The final TermResolution operator performs a mapping from these internal identifiers to their lexicographical form. In regular (non-explain mode) query evaluation, the result of this final transformation is then serialized into the requested serialization format and streamed to the client.

Optimization by example: Understanding the impact of DISTINCT

One interesting aspect that becomes immediately apparent from the query plan is that the Distinct operator effectively performs no work, it has 45 solutions flowing in and the same number of solutions flowing out. Looking at our example, we’re using an open source dataset that contains all of the available commercial air routes from around the world. It contains a graph model with airports and corresponding routes between them. This is the same query (query1.sparql) that we executed earlier to demonstrate the output of the SPARQL explain feature.

query1.sparql

PREFIX prop: <http://kelvinlawrence.net/air-routes/vocab/prop#> 
PREFIX airport: <http://kelvinlawrence.net/air-routes/data/airport#> 
SELECT DISTINCT ?via WHERE {
  ?route1 prop:from airport:FRA . 
  ?route1 prop:to ?via . 
  ?route2 prop:from ?via . 
  ?route2 prop:to airport:SEA .
}

The basic schema of the Air Routes data is illustrated in the Figure below, by example of two directly connected Airports, Seattle Tacoma and Frankfurt am Main Airport. In this case, the two airports are connected via a flight route (here identified by the node “5090”). Each airport is described by a number of properties such as code, city, latitude, longitude, the country in which the airport is located.

The query checks for a route ?route1 starting from Seattle, ending at an intermediate stop identified by variable ?via, and a route ?route2 starting at the intermediate stop ?via and ending at Frankfurt.

When executing the query over the AirRoutes dataset, the query reports 45 airport identifiers, reflecting all the 45 possible ways of getting from Frankfurt to Seattle with a single stop. In the URI scheme implemented in the AirRoutes dataset, the last three letters in the identifiers represent the airport codes:

via

http://kelvinlawrence.net/air-routes/data/airport#TPE
http://kelvinlawrence.net/air-routes/data/airport#YVR
http://kelvinlawrence.net/air-routes/data/airport#KEF
http://kelvinlawrence.net/air-routes/data/airport#HND
http://kelvinlawrence.net/air-routes/data/airport#YYC

… (40 more)

In our example, two connected airports are always connected by a single route only. This means that our query produces DISTINCT results by design, even without an explicit DISTINCT operator in the SPARQL query. This is useful information as we may want to remove the DISTINCT operator in our SPARQL query to save the time and required memory for its computation. Let’s look at the explain output for the query without DISTINCT:

query2.sparql

PREFIX prop: <http://kelvinlawrence.net/air-routes/vocab/prop#> 
PREFIX airport: <http://kelvinlawrence.net/air-routes/data/airport#> 
SELECT ?via WHERE {
  ?route1 prop:from airport:FRA . 
  ?route1 prop:to ?via . 
  ?route2 prop:from ?via . 
  ?route2 prop:to airport:SEA .
}

As expected, the plan now omits the Distinct operator and thus gets by with a little less work:

curl -s http://$NEPTUNE_CLUSTER_ENDPOINT:$NEPTUNE_CLUSTER_PORT/sparql?explain=dynamic \
     -d "@query2.sparql" \
     -H "Content-type: application/sparql-query" \
     -H "Accept: text/csv"

╔════╤════════╤════════╤═══════════════════╤═══════════════════════════════════════════════════╤══════════╤══════════╤═══════════╤════════╗
║ ID │ Out #1 │ Out #2 │ Name              │ Arguments                                         │ Mode     │ Units In │ Units Out │ Ratio  ║
╠════╪════════╪════════╪═══════════════════╪═══════════════════════════════════════════════════╪══════════╪══════════╪═══════════╪════════╣
║ 0  │ 1      │ -      │ SolutionInjection │ solutions=[{}]                                    │ -        │ 0        │ 1         │ 0.00   ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 1  │ 2      │ -      │ PipelineJoin      │ pattern=distinct(?route2, prop:to, airport:SEA)   │ -        │ 1        │ 118       │ 118.00 ║
║    │        │        │                   │ joinType=join                                     │          │          │           │        ║
║    │        │        │                   │ joinProjectionVars=[?route2]                      │          │          │           │        ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 2  │ 3      │ -      │ PipelineJoin      │ pattern=distinct(?route2, prop:from, ?via)        │ -        │ 118      │ 118       │ 1.00   ║
║    │        │        │                   │ joinType=join                                     │          │          │           │        ║
║    │        │        │                   │ joinProjectionVars=[?route2, ?via]                │          │          │           │        ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 3  │ 4      │ -      │ PipelineJoin      │ pattern=distinct(?route1, prop:to, ?via)          │ -        │ 118      │ 10030     │ 85.00  ║
║    │        │        │                   │ joinType=join                                     │          │          │           │        ║
║    │        │        │                   │ joinProjectionVars=[?route1, ?via]                │          │          │           │        ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 4  │ 5      │ -      │ PipelineJoin      │ pattern=distinct(?route1, prop:from, airport:FRA) │ -        │ 10030    │ 45        │ 0.00   ║
║    │        │        │                   │ joinType=join                                     │          │          │           │        ║
║    │        │        │                   │ joinProjectionVars=[?route1]                      │          │          │           │        ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 5  │ 6      │ -      │ Projection        │ vars=[?via]                                       │ retain   │ 45       │ 45        │ 1.00   ║
╟────┼────────┼────────┼───────────────────┼───────────────────────────────────────────────────┼──────────┼──────────┼───────────┼────────╢
║ 6  │ -      │ -      │ TermResolution    │ vars=[?via]                                       │ id2value │ 45       │ 45        │ 1.00   ║
╚════╧════════╧════════╧═══════════════════╧═══════════════════════════════════════════════════╧══════════╧══════════╧═══════════╧════════╝

How to get started with SPARQL explain and the open source air routes dataset

In this blog post, we illustrated how to use the new SPARQL explain feature by use of sample queries over the open source Air Routes dataset. We have converted this dataset into the popular Turtles file serialization format that Neptune supports for bulk loading.

If you want to practice with the examples from this blog post, you can either load the dataset into one of your Neptune instances using Neptune’s data loader, or even more conveniently, we have prepared an AWS CloudFormation stack that creates a small stack containing a Neptune cluster and an Amazon SageMaker Notebook instance. The Notebook instance contains a Jupyter notebook containing all the example queries discussed in the preceding post. Please be aware that costs apply for the infrastructure during the time the stack is running. To set up the stack:

Open a browser and go to http://console.aws.amazon.com. Log in to your AWS account.
Once logged in, click on one of the Launch Stack buttons below. These are region-specific and will launch this example CloudFormation stack in the Region of your choice:

Region	View	Launch
US East 1 (N. Virginia)	View
US East 2 (Ohio)	View
US West 2 (Oregon)	View
EU West 1 (Ireland)	View
EU West 2 (London)	View
EU Central 1 (Frankfurt)	View

The Launch Stack links above contain parameters to auto-populate the title of the stack as “SPARQL-Explain” along with a location of the Jupyter notebook to import.
- Be sure to check the two boxes at the bottom of the page, allowing AWS CloudFormation to create IAM resources and to execute nested stacks.
- Click Create Stack. This creates a fresh Neptune Instance (r4.xlarge), an Amazon SageMaker Notebook instance containing a Jupyter notebook tied to the Neptune instance, and required infrastructure (such as a VPC, data loader role, etc.).
Once your stack is ready, browse to the Outputs of the Stack within the AWS CloudFormation Console.
- Find the SageMakerNotebook This is a link to the Amazon SageMaker Notebook instance. Click on this link to open the Jupyter interface for this instance.

Once Jupyter loads, you should see a single directory called Neptune.
- Click on the Neptune directory and then click on the sparqlexplain This should contain a single Jupyter notebook called SPARQLExplainAndQueryHints.ipynb.
- Click on this notebook file to open it.

Once the notebook has opened, you can walk through the examples by executing the cells in sequence using the “play” button at the top. If you want to change examples, you may modify existing cells (or create new ones) and re-execute them, etc.

Please don’t forget to shut down the AWS CloudFormation stack to clean up deployed infrastructure resources once you are done playing with it. You can do this by browsing back to the CloudFormation console, choosing the root stack (called SPARQL-Explain) and clicking the Delete button.

Conclusion

SPARQL explain is a great tool to understand and tune Neptune’s runtime behavior. It can be useful to:

understand the behavior of queries, for example, why queries are actually challenging to evaluate
when you seek for writing improved versions of queries. For example, by removing redundant operators, such as the DISTINCT in the example above
identifying time-space tradeoffs
or watching out for scenarios where a hand-tuned plan may improve upon the optimizer-generated plan.

In particular, when you are building higher-level graph applications on top of Neptune based on a set of SPARQL query templates, understanding and tuning the evaluation behavior of your templates via SPARQL explain can provide additional performance potentials. This ultimately improves the user experience of your application’s end users.

If you dive deeper into SPARQL explain, you notice that the examples only scratched the surface. Our official explain documentation contains an exhaustive reference for all the operators and possible arguments. As always, AWS welcomes feedback, so please leave any comments or questions below.

About the Authors

Taylor Riggan is a Sr. Specialist Solutions Architect covering Graph and In-Memory Databases with Amazon Web Services. He works with customers of all sizes to help them learn and use purpose-built, NoSQL databases.

Michael Schmidt is a Sr. Software Development Engineer with Amazon Web Services. In his role as lead architect for Neptune’s query optimization stack, he works both with AWS customers and internal development teams to improve Amazon Neptune’s performance, scalability, and user experience.