Building a Graph Database on AWS Using Amazon DynamoDB and Titan

At AWS re:Invent 2017, we announced the preview of Amazon Neptune, a fast and reliable graph database built for the cloud. Though this blog post still shows the benefits a graph database can deliver for certain use cases, if you are about to build an application yourself and need a graph database, you should first check out Neptune. It is fully managed and highly available, and it includes read replicas, point-in-time recovery, and continuous backups to Amazon S3. Neptune supports Apache TinkerPop Gremlin, the query language used in this post, and the W3C standards of the Resource Description Framework (RDF) 1.1 and its query language, SPARQL 1.1.

The Titan plugin in this post has been superseded by a new plugin for JanusGraph, a fork of the Titan project. Download the plugin and use it on a self-managed basis with Amazon DynamoDB.

———————————————-

Nick Corbett is a Big Data Consultant for AWS Professional Services

You might not know it, but a graph has changed your life. A bold claim perhaps, but companies such as Facebook, LinkedIn, and Twitter have revolutionized the way society interacts through their ability to manage a huge network of relationships. However, graphs aren’t just used in social media; they can represent many different systems, including financial transactions for fraud detection, customer purchases for recommendation engines, computer network topologies, or the logistics operations of Amazon.com.

In this post, I would like to introduce you to a technology that makes it easy to manipulate graphs in AWS at massive scale. To do this, let’s imagine that you have decided to build a mobile app to help you and your friends with the simple task of finding a good restaurant. You quickly decide to build a ‘server-less’ infrastructure, using Amazon Cognito to identity management and data synchronization, Amazon API Gateway for your REST API, and AWS Lambda to implement microservices that fulfil your business logic. Your final decision is where to store your data. Because your vision is to build a network of friends and restaurants, the natural choice is a graph database rather than an RDBMS. Titan running on Amazon DynamoDB is a great fit for the job.

DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance together with seamless scalability. Recently, AWS announced a plug-in for Titan that allows it to use DynamoDB as a storage backend. This means you can now build a graph database using Titan and not worry about the performance, scalability, or operational management of storing your data.

Your vision for the network that will power your app is shown below and shows the three major parts of a graph: vertices (or nodes), edges, and properties.

A vertex (or node) represents an entity, such as a person or restaurant. In your graph, you have three types of vertex: customers, restaurants, and the type of cuisine served (called genre in the code examples).
An edge defines a relationship between two vertices. For example, a customer might visit a restaurant or a restaurant may serve food of a particular cuisine. An edge always has direction – it will be outgoing from one vertex and incoming to the other.
A property is a key-value pair that enriches a vertex or an edge. For example, a customer has a name or the customer might rate their experience when they visit a restaurant.

After a short time, your app is ready to be released, albeit as a minimum viable product. The initial functionality of your app is very simple: your customer supplies a cuisine, such as ‘Pizza’ or ‘Sushi’, and the app returns a list of restaurants they might like to visit.

To show how this works in Titan, you can follow these instructions in the AWS Big Data Blog’s GitHub’ repository to load some sample data into your own Titan database, using DynamoDB as the backend store. The data used in this example was based on a data set provided by the Machine Learning Repository at UCL¹. By default, the example uses Amazon DynamoDB Local, a small client-side database and server that mimics the DynamoDB service. This component is intended to support local development and small scale testing, and lets you save on provisioned throughput, data storage, and transfer fees.

Interaction with Titan is through a graph traversal language called Gremlin, in much the same way as you would use SQL to interact with an RDBMS. However, whereas SQL is declarative, Gremlin is implemented as a functional pipeline; the results of each operation in the query are piped to the next stage. This provides a degree of control on not just what results your query generates but also how it is executed. Gremlin is part of the Open Source Apache TinkerPop stack, which has become the de facto standard framework for graph databases and is supported by products such as Titan, Neo4j, and OrientDB.

Titan is written in Java and you can see that this API is used to load the sample data by running Gremlin commands. The Java API would also be used by your microservices running in Lambda, calling through to DynamoDB to store the data. In fact, the data stored in DynamoDB is compressed and not humanly readable (for more information about the storage format, see Titan Graph Modeling in DynamoDB).

For the purposes of this post, however, it’s easier to user the Gremlin REPL, written in Groovy. The instructions on GitHub show you how to start your Gremlin session.

A simple Gremlin query that finds restaurants based on a type of cuisine is shown below:

gremlin> g.V.has('genreId', 'Pizzeria').in.restaurant_name

==>La Fontana Pizza Restaurante and Cafe
==>Dominos Pizza
==>Little Cesarz
==>pizza clasica
==>Restaurante Tiberius

This introduces the concept of how graph queries work; you select one or more vertices then use the language to walk (or traverse) across the graph. You can also see the functional pipeline in action as the results of each element are passed to the next step in the query. The query can be read as shown below.

Network that will power your app

The query gives us five restaurants to recommend to our customer. This query would be just as easy to run if your data was based in an RDBMS, so at this point not much is gained by using a graph database. However, as more customers start using your app and the first feature requests come in, you start to feel the benefit of your decision.

Initial feedback from your customers is good. However, they tell you that although it’s great to get a recommendation based on a cuisine, it would be better if they could receive recommendations based on places their friends have visited. You quickly add a ‘friend’ feature to the the app and change the Gremlin query that you use to provide recommendations:

This query assumes that a particular user (‘U1064’) has asked us to find a ‘Cafeteria’ restaurant that their friends have visited. The Gremlin syntax can be read as shown below.

This query uses a pattern called ‘backtrack’. You make a selection of vertices and ‘remember’ them. You then traverse the graph, selecting more nodes. Finally, you ‘backtrack’ to your remembered selection and reduce it to those vertices that have a path through to your current position.

Again, this query could be executed in an RDBMS but it would be complex. Because you would keep all customers in a single table, finding friends would involve looping back to join a table to itself. While it’s perfectly possible to do this in SQL, the syntax can become long—especially if you want to loop multiple times; for example, how many of my friends’ friends’ have visited the same restaurant as me? A more important problem would be the performance. Each SQL join would introduce extra latency to the query and you may find that, as your database grows, you can’t meet the strict latency requirements of a modern app. In my test system, Titan returned the answer to this query in 38ms, but the RDBMS where I staged the data took over 0.3 seconds to resolve, an order of magnitude difference!

Your new recommendations work well, but some customers are still not happy. Just because their friends visited a restaurant doesn’t mean that they enjoyed it; they only want recommendations to restaurants their friends actually liked. You update your app again and ask customers to rate their experience, using ‘0’ for poor, ‘1’ for good, and ‘2’ for excellent. You then modify the query to:

g.V.has('userId','U1101').out('friend').outE('visit').has('visit_food', T.gte, 1).as('x').inV.as('y').out('restaurant_genre').has('genreId', 'Seafood').back('x').transform{e, m -> [food: m.x.visit_food, name:m.y.restaurant_name]}.groupCount{it.name}.cap

==>{Restaurante y Pescaderia Tampico=1, Restaurante Marisco Sam=1, Mariscos El Pescador=2}

This query is based on a user (‘U1101’) asking for a seafood restaurant. The stages of the query are shown below.

This query shows how you can filter for a property on an edge. When you traverse the ‘visit’ edge, you filter for those visits where the food rating was greater or equal than 1. The query also shows how you can transform results from a pipeline to a new object. You build a simple object, with two properties (food rating and name) for each ‘hit’ you have against your query criteria. Finally, the query also demonstrates the ‘groupCount’ function. This aggregation provides a count of each unique name.

The net result of this query is that the ‘best’ seafood restaurant to recommend is ‘Mariscos El Pescador’, as your customer’s friends have made two visits in which they rated the food as ‘good’ or better.

The reputation of your app grows and more and more customers sign up. It’s great to take advantage of DynamoDB scalability; there’s no need to re-architect your solution as you gain more users, as your storage backend can scale to deal with millions or even hundreds of millions of customers.

Soon, it becomes apparent that most of your customers are using your app when they are out and about. You need to enhance your app so that it can make recommendations that are close to the customer. Fortunately, Titan comes with built-in geo queries. The query below imagines that customer ‘U1064’ is asking for a ‘Cafeteria’ and that you’ve captured their location of their mobile as (22.165, -101.0):

g.V.has('userId', 'U1064').out('friend').outE('visit').has('visit_rating', T.gte, 2).has('visit_food', T.gte, 2).inV.as('x').out('restaurant_genre').has('genreId', 'Cafeteria').back('x').has('restaurant_place', WITHIN, Geoshape.circle(22.165, -101.00, 5)).as('b').transform{e, m -> m.b.restaurant_name + " distance " + m.b.restaurant_place.getPoint().distance(Geoshape.point(22.165, -101.00).getPoint())} 

==>Luna Cafe distance 2.774053451453471
==>Cafeteria y Restaurant El Pacifico distance 3.064723519030348

This query is the same as before except that there’s an extra filter:

has('restaurant_place', WITHIN, Geoshape.circle(22.165, -101.00, 5)).

Each restaurant vertex has a property called ‘restaurant_place’, which is a geo-point (a longitude and latitude). The filter restricts selection to any restaurants whose ‘restaurant_place’ is within 5km of the customer’s current location. The part of the query that transforms the output from the pipeline is modified to include the distance to the customer. You can use this to order your recommendations so the nearest is shown first.

Your app hits the big time as more and more customers use it to find a good dining experience. You are approached by one of the restaurants, which wants to run a promotion to acquire new customers. Their request is simple – they will pay you to send an in-app advert to your customers who are friends of people who have visited their restaurant, but who haven’t visited the restaurant themselves. Relieved that your app can finally make some money, you set about writing the query. This type of query follows a ‘except’ pattern:

gremlin> x = []
gremlin> g.V.has('RestaurantId','135052').in('visit').aggregate(x).out('friend').except(x).userId.order

The query assumes that RestaurantId 135052 has made the approach. The first line defines a variable ‘x’ as an array. The steps of the query are shown below.

The ‘except’ pattern used in this query makes it very easy to select elements that have not been selected in a previous step. This makes queries such as the above or “who are a customer’s friend’s friends that are not already their friends” easy resolve. Once again, you could write this query in SQL, but the syntax would be far more complex than the simple Gremlin query used above and the multiple joins needed to resolve the query would affect performance.

Summary

In this post, I’ve shown you how to build a simple graph database using Titan with DynamoDB for storage. Compared to a more traditional RDBMS approach, a graph database can offer many advantages when you need to model a complex network. Your queries will be easier to understand and you may well get better performance from using a storage engine geared towards graph traversal. Using DynamoDB for your storage gives the added benefit of a fully managed, scalable repository for storing your data. You can concentrate on producing an app that excites your customers rather than managing infrastructure.

If you have any questions or suggestions, please leave a comment below.

References

Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys’11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011

——————————————–

AWS Big Data Blog

Building a Graph Database on AWS Using Amazon DynamoDB and Titan

Summary

Resources

Follow