Introducing Gremlin query hints for Amazon Neptune
Amazon Neptune is a fast, reliable, fully managed graph database, optimized for storing and querying highly connected data. It is ideal for online applications that rely on navigating and leveraging connections in their data.
Amazon Neptune supports W3C RDF graphs that can be queried using the SPARQL query language. It also supports Apache TinkerPop property graphs that can be queried using the Gremlin graph traversal and query language.
In this post, we look at some new features available for Gremlin users. Specifically, we focus on the new query hints capability, which gives you greater control over how your queries are executed.
Gremlin query hints
Gremlin query hints are designed to give you more control over how queries traverse your data, and to specify whether or not you want query order to be optimized. The query hints are especially useful when you are familiar with the shape of your data.
To apply a query hint, use the following withSideEffect Gremlin step. You replace <hint-name> with the name of the applicable hint, and <hint-setting> with an appropriate value. All of the new hints begin with the Neptune# prefix.
The currently available hints and hint settings are shown in the following table. We take a close look at each one later in this post.
|Hint name||Hint setting||Action|
|noReordering||true||Neptune does not try to optimize the query order of execution. This is most useful if you know how your data is organized and do not want Neptune to try to change the order in which constraints are applied.|
|noReordering||false||Neptune tries to optimize the query order of execution, in order to apply the strongest constraints first. This is the default noReordering setting.|
|repeatMode||BFS||Traverses the graph in a breadth first fashion. This is the default traversal mode for Amazon Neptune when using the Gremlin repeat step, and it is the preferred mode for many use cases. However, it can also consume significant memory when a large number of vertices have to be examined.|
|repeatMode||DFS||Traverses the graph in a depth first fashion. This mode is most useful when performing a query to find a result that may be many levels deep from the starting point, but where every vertex at each level of depth does not need to be inspected. Because fewer vertices need to be fetched, this mode typically requires less memory, and can provide the best performance.|
|repeatMode||CHUNKED_DFS||A hybrid mode that inspects a maximum of 1,000 vertices at each depth of traversal. This mode is useful when you want to look at some of the vertices at each depth of traversal, but do not need to look at all vertices. In large graphs, this mode provides a good balance between performance and memory usage.|
See Gremlin Query Hints in the Amazon Neptune User Guide for detailed documentation and additional examples of how to use the new query hints.
When evaluating a Gremlin query, by default the Neptune query engine analyzes the query and attempts to build a query plan that minimizes the amount of work that needs to be done, optimizing it for efficient execution.
Because of this optimization, the exact order in which constraints are executed may differ from the order specified in the query.
There are some cases where you may want to disable this optimization. Consider the following example, where two constraints are specified using connected has steps.
In order to achieve the best performance and to minimize the amount of data that has to be fetched and analyzed, the Neptune engine may decide to execute the query as if it had been written as follows.
However, it’s possible that Neptune will pick a query ordering that is not optimal for your specific data and use case. If you know the characteristics of your data and want to specify an exact order for the query, you can do that by using the noReordering hint. In the preceding example, you might know that the phone_number property is unique across all vertices in your graph but that names may not be unique. Therefore, a search that uses the phone number before the name could potentially fetch less data, compared to fetching all of the people named ‘Kelvin’ before checking phone numbers for the specific number we’re interested in.
The following example includes a noReordering hint, telling Neptune not to try to re-order the query.
The repeatMode hints allow us to tell Neptune which of three possible strategies we want to use to traverse a graph.
By default, Neptune uses a breadth first search (BFS) strategy. In a BFS search, all vertices at the current depth are examined before moving to vertices one layer deeper, and so on. For many situations, this strategy works well. However, the BFS strategy is the most memory intensive because of the significant state data that has to be maintained at each depth of the search.
If the vertices we are most interested in are more likely to be found by investigating the deeper parts of a graph first, a depth first search (DFS) strategy can be used. In a DFS search, only one vertex at each given depth is examined before moving on to the next depth. This process repeats until the target is found, or some other constraint is reached, such as the maximum depth specified in the query. The DFS procedure is then repeated until every possible path has been investigated or the constraints specified in the query (such as a limit step) are achieved. When the results we need are deeper in a graph, a DFS approach is likely to start returning results to the calling application much faster than in a BFS search.
A third use case requires a hybrid approach. In this case, we want to get a good sample of vertices at each depth, but do not necessarily want to look at them all before continuing to the next depth. This hybrid strategy uses less memory than a BFS search, but it examines more data at each depth than a DFS search. To traverse a graph with this hybrid approach, Neptune offers the CHUNKED_DFS hint. A maximum of 1,000 vertices are examined at each depth as the query progresses.
To help demonstrate the Neptune#repeatMode query hints, we are going to use a small graph that represents an ordered binary tree. The following Gremlin code, which can be pasted directly into the Gremlin console when it is connected to an Amazon Neptune endpoint, creates the graph. The repeatMode hints work with any shape of graph data, but data that models a tree with some obvious depths makes it easier to visualize how they work.
The following diagram shows a pictorial representation of the small binary tree that the preceding Gremlin code creates.
The following Gremlin query fetches all of the nodes in the graph in a breadth first (BFS) mode, which is the default setting for Amazon Neptune when using the Gremlin repeat step. If you have experience with Gremlin using Apache TinkerGraph, this mode most closely models how TinkerGraph works when using repeat steps in your queries.
When the preceding query is executed using the Gremlin console connected to a Neptune instance, you see the following results. If you compare the output to the diagram of our binary tree, you can see that all of the values from each depth are found before proceeding to the next depth.
Now let’s change the query to fetch the nodes in the graph in a depth first (DFS) mode.
This time, the results look quite different. As you can see, starting from the root node (9), the query first found 5, 2, and 1. In other words, it went as deeply as possible from the root node down the first branch of the tree before coming back to the root and going down the next branch.
Our example graph is not big enough to consume much memory while being searched, so there is no need to use the CHUNKED_DFS hint. However, if we had a big tree with well over 1,000 nodes at some of the depths, a mode that uses less memory would be useful.
Let’s now consider a use case where DFS gives us the most efficient traversal. Imagine we want to find the node in the graph that represents the value 1. The following query uses a repeat…until traversal to search for the node we want. By including an emit step in the query and adding a path step at the end, we can see exactly how the search takes place.
As you might expect from the earlier examples, the 1 node is found quickly.
Let’s now run the same query, but using BFS mode.
As you can see from the following results, this time many more nodes are examined before we find the node representing our 1 value, because we search breadth first. Of course, in a tiny graph like this we can find the result we want fairly quickly using either DFS or BFS mode. However, in a graph with thousands of nodes to be inspected at each depth, DFS mode would be much more efficient.
In this post we introduced the new Gremlin query hints that are available in Amazon Neptune. You can use these hints as you create queries, and benefit from having additional control over how the Neptune query engine executes your queries.
You can find additional articles, videos, and examples on the Amazon Neptune resources web page.
About the Authors
Kelvin Lawrence is a Principal Data Architect in the Database Services Customer Advisory Team focused on Amazon Neptune and many other related services. He has been working with graph databases for many years, is the author of the book “Practical Gremlin” and is a committer on the Apache TinkerPop project.
Ian Robinson is an architect with the Database Services Customer Advisory Team. He is a coauthor of ‘Graph Databases’ and ‘REST in Practice’ (both from O’Reilly) and a contributor to ‘REST: From Research to Practice’ (Springer) and ‘Service Design Patterns’ (Addison-Wesley)