AWS Big Data Blog

Performance Tuning Your Titan Graph Database on AWS

Nick Corbett is a Big Data Consultant for AWS Professional Services

Graph databases can outperform an RDBMS and give much simpler query syntax for many use cases. In my last post, Building a Graph Database on AWS Using Amazon DynamoDB and Titan, I showed how a network of relationships can be stored and queried using a graph database. In this post, I show you how to tune the performance of your Titan database running on Amazon DynamoDB in AWS.

Earlier today, AWS released a new version of the Amazon DynamoDB Storage Backend for Titan, which is compatible with Titan v1.0.0. Titan v1.0.0 moves on from the previous version, v0.5.4, with improvement to the query execution engine and compatibility with the latest version of the Apache TinkerPop stack (v3). (See the What’s New post for details about this release.)

The following diagram shows you the stack architecture:

As you saw in my last post, your application runs Gremlin commands on the graph database to traverse the vertices and edges, in much the same way as you would run SQL against an RDBMS. Your application interacts with the TinkerPop API; either calling Titan directly or calling a TinkerPop Gremlin Server instance. You can think of the TinkerPop API in a similar way to ODBC; it gives a standard, vendor-neutral interface to a graph database and is implemented by major graph engines including Titan, Neo4J, or OrientDB.

Here’s an example of a network that is best modelled using a graph database. On the 4th July 2012, CERN announced the preliminary result of the discovery of the Higgs Boson from teams working with the Large Hadron Collider. While most of us struggle to understand what a Higgs particle is and the true implications of its discovery, the scientific community was hugely interested in this event. The sample data1 used in this post records Twitter activity between the 1st and 7th of July 2012 that includes one of the following keyword or hashtags: lhc, cern, boson, higgs.

This data can be used to understand how information flows from one user (or vertex) in the graph to another. To do this, you can run the following ‘path finder’ query that calculates all the paths between two vertices:

s=System.currentTimeMillis(); g.V.has('UserId', 46996).outE.has('tweettype', 'RT').inV.loop(3){it.object.UserId != 33833 && it.loops < 6}.path.filter{it.last().UserId == 33833}.transform{[id: it.UserId, time:System.currentTimeMillis() - s]}.iterate(); t = System.currentTimeMillis() - s;

This was adapted from the shortest path recipe in the Gremlin documentation and finds all paths between two users that are connected by re-tweets. The query starts at the vertex with UserId 46996 and will traverse a maximum of 6 edges in all directions in an attempt to find UserId 33833. You can use the number of traversals to change how easy it is for the system to calculate this query. The greater the number of traversals, the more vertices need to be considered and the more demanding the query.

When creating a Titan graph using DynamoDB, your first decision is whether to use the single-item or multi-item data model. Titan uses the BigTable paradigm to persist data and the DynamoDB Storage Backend supports these two implementations of this. Both store vertex, edge, and property in a DynamoDB table called edgestore (for more details about the tables the storage backend uses, see Titan Graph Modeling in DynamoDB).

However, the single-item model puts each vertex, together with all its properties, edges, and properties of its edges into in a single DynamoDB item in this table. The multi-item model stores each piece of information in a separate DynamoDB item and uses the hash key to link all information for a particular vertex, as shown in the following example tables. Note that the actual data stored is in binary format and not human readable.

The advantage of the single-item data model is that DynamoDB consumes fewer IOPS to read your data so it’s quicker to run queries.

To demonstrate this, the Twitter data was loaded into two graph databases, one for each data model. The path finder query was run using 4 traversals. The query against data stored in the single-item data model returned in 25% of the time, taking 1114 ms compared to 4484 ms for the multi-item format.

However, even the query against the data stored in the multi-item model was faster than running the equivalent query in an RDBMS. The same data was loaded into MySQL running in Amazon Relational Database Service (RDS) with an equivalent instance size (4 CPU / 16 GiB memory).  A SQL statement took 10204 ms to find all the paths between the two vertices.

In general, queries against a graph stored in the single-item data model outperform those stored in the multi-item data model. However, if you need to access a single property or edge, then the multi-item model is faster as the cost of reading the smaller row is less. Also, if you update a vertex or edge in a multi-item data model the cost of changing the smaller row is cheaper and quicker.

The main restriction of the single-item data model is the size of the row, as all properties, edges (both in and out), and edge properties must fit into a single DynamoDB item, currently capped at 400 KB. After it’s created, you cannot change the storage model of your graph, so you need to be sure that you won’t exceed this limit.

At first glance, the Twitter data set looks like a good candidate for the single-item data model, as the only property associated with each vertex is a UserId. However, if live data was captured and added to this model, new edges would be constantly discovered as users retweet and mention each other. Each new edge adds extra storage requirements to its parent vertex; if a user is too active, then you can’t store all their interactions in a single item. Because you don’t want to find yourself in a situation where you can’t add a new edge to a vertex, its best to use the multi-item storage model unless you are absolutely sure the single model works for you.

Using the multiple-item storage model allows your graph to have highly connected vertices or lots of properties, but it does come with a performance cost. This “storage versus performance” choice is not one that most of us would want to make and it is worth looking at the architecture and configuration of the system to see if you can tune performance.

In order to tune your graph database, you need visibility into how the system is performing. Titan uses the Open Source Metrics package to make many measurements visible at run-time. Metrics integrates with many common tools, such as Log4J, Ganglia or Graphite making it easy to get insight into your system. The DynamoDB Storage Backend extends this by adding metrics of its own, which can be enabled in the graph configuration.  By combining these with measures taken from Amazon CloudWatch, you will be able to understand what is happening in your graph database.

The obvious place to start is to increase the size of the instance that is running the Gremlin query. The graph below shows the time taken to run the 6 loop version of the path finder query on various instances in the m4 family. As you can see, increasing the instance size does improve performance but not by much, as shown on the following graph:

As you can see, moving from a m4.large (2 vCPU, 8 GiB memory) instance to an m4.2xlarge (8 vCPU, 32 GiB) only gives a 9% gain in performance when running this particular query, which shows it isn’t bound by memory or CPU. The maximum IOPS against the DynamoDB table is also shown on the graph above (orange line), taken from the DynamoDB Storage Backend metric. The edgestore table was configured to allow 750 read IOPS, but you only see between 200 and 250 when the query is run.

You can configure the DynamoDB storage backend to limit the number of IOPS it uses against each table. Specifically, by changing the storage.dynamodb.stores.edgestore.read-rate configuration parameter, you can cap the maximum IOPS that the storage backend consumes against the main table. The following graph shows the same query run as above, on an m4.xlarge instance. This time, however, the storage backend was configured to limit the read rate against the edgestore table.

The time taken to run the ten-loop query was recorded:

As the maximum read-rate of the storage backend is increased, the time taken to execute the query (blue line) shows exponential decay. This tells us that, for this particular query, the system can only achieve about 200 IOPS against the DynamoDB table. This number will vary from query to query. For example, the following query times how long it takes to build a deduped list of all vertices that can be reached within 8 edge traversals of UserId 46996:

s=System.currentTimeMillis(); a = g.V.has('UserId', 46996).out.loop(1){it.loops < 8}{true}.dedup.count();s=System.currentTimeMillis()-s; 	

Running this query on an m4.xlarge instance generated over 550 IOPS against the edgestore table, as shown in the following Amazon CloudWatch screenshot:

The tests described so far have all been run on a single EC2 instance, using the Gremlin Console (part of the TinkerPop stack) as a Titan client. This architecture works well if the client is JVM-compatible, as it can directly call the Titan libraries. The Gremlin Console is written in Groovy and directly calls the TinkerPop API of the Titan libraries.

An alternative approach is to use Gremlin Server. Gremlin Server is also part of the TinkerPop stack and provides REST and WebSockets interfaces to allow multiple clients to run queries against the database. If you choose to put a Gremlin Server instance into your architecture, you still execute the same Gremlin query but will be able to run these from non-JVM clients. You can also choose to have more memory or CPU as this instance is responsible for resolving your queries.

To get more throughput in a multi-client solution, you can scale out your Gremlin Server instances.  A standard configuration of putting the instances in an Auto Scaling group, spread over multiple Availability Zones works best. You can use an Elastic Load Balancing load balancer to manage the traffic to the instances.

If you want to use the WebSockets interface, configure your load balancer to use TCP (rather than HTTP); if you want to retain information about client IP addresses, enable Proxy Protocol Support. You need to make sure that the DynamoDB Storage Backend configuration on each Gremlin Server instance limits the number of IOPS so that the total generated by all your instances does not exceed the provisioned IOPS your DynamoDB table.

If you choose to scale out your Gremlin Server instances, you need to make sure they are in session-less mode. This means that your entire Gremlin query needs to be encapsulated in a single request to a server, which has the advantage that each server is stateless and can easily scale.

Summary

In this post, I have shown you how the storage model and instance size can affect the performance of a Titan graph database using DynamoDB for storage. The single-item data model outperforms the multiple-item data model in most cases, but comes with the cost of limiting the amount of information you can store.

I have also shown you how to get maximum performance out of a single instance running Gremlin queries and how to increase system throughput by scaling out. By using a combination of CloudWatch and the metrics that the DynamoDB Storage Backend creates, you can get insight into how your graph database is performing and make decisions on how to design your system for best performance.

References

  1. M. De Domenico, A. Lima, P. Mougel and M. Musolesi. The Anatomy of a Scientific Rumor. (Nature Open Access) Scientific Reports 3, 2980 (2013).
  2. Picture of Higgs Boson taken from CERN Document Server. Shared under creative commons license.