AWS Open Source Blog
Why developers like Apache TinkerPop, an open source framework for graph computing
Apache TinkerPop is an open source computing framework for graph databases and graph analytic systems. Designed to appeal to software developers, TinkerPop lets developers add graph computing capabilities to their applications without worrying about developing APIs, graph processing engines, or graph algorithms. Although Apache TinkerPop is an open source project rather than a formal standard, for many programmers, the protocols and frameworks defined by the TinkerPop project have become the de-facto way to build graph database engines and applications such as knowledge graphs, fraud graphs, and identity graphs.
In this post, we provide an introduction to the Apache TinkerPop open source project and explain how it helps developers create and explore directed property graphs.
What is Apache TinkerPop?
Apache TinkerPop is perhaps best known as the home of the Gremlin query language along with the popular Gremlin character artwork originated by Ketrina Yim. There is a lot more to the project than a query language and a sassy gremlin, though.
Major components that make up the Apache TinkerPop project include:
- The Gremlin query language
- The Gremlin Console (provides simple REPL access to a graph)
- A Gremlin Server implementation that can be used to host remote graphs
- Serialization formats and connection protocols
- Client drivers that support a number of popular programming languages
- Support for both OLTP and OLAP access patterns
- A complete reference implementation written in Java for both server and client
- The TinkerGraph (in memory) graph database
- Reference documentation, recipes, and more
Launched in 2009, TinkerPop was created as an open source project in a time in which there were a number of existing open source and commercially available graph databases using their own APIs. The TinkerPop project founders sought a method to unify the notion of graph computing behind a single set of interfaces with a common query language and toolset that would let users work with them in an agnostic fashion.
Developers could build their graph applications in a vendor-agnostic manner with TinkerPop and avoid being locked into a particular graph database choice.
Because TinkerPop included the Gremlin query language, graph processing engine, a graph provider test suite, and immediate connection to the rest of the TinkerPop ecosystem, it became a popular framework for existing graph databases to hook into opening up their system to the many TinkerPop users out there. But it also provided a launching point for a number of new graph databases that natively implemented TinkerPop interfaces and expanded user choices in that space.
Irrespective of whether a graph database was an existing one or a newcomer, a key benefit to the developers of these systems was that TinkerPop allowed them to focus their energies on data storage efficiencies, query optimizations, and other lower-level features. With TinkerPop, vendors could differentiate from their competitors and attract new users in their own ways, as opposed to expending resources on the graph querying and processing that TinkerPop already provided.
By the time TinkerPop 3 reached its first milestone releases in 2015, the code base was contributed to the Apache Software Foundation, where new releases along the TinkerPop 3 line continue to be produced today. Since its inception, TinkerPop 3 has produced more than 70 official releases, proving itself as a mature and well-maintained piece of software.
TinkerPop adoption
The Apache TinkerPop documentation currently lists close to 30 different implementations based on the technology. These include both other open source projects such as JanusGraph, an offshoot of the Titan graph database project that was among the earliest to natively implement TinkerPop interfaces, and commercial offerings from well-known vendors, including Amazon, DataStax, IBM, and Microsoft. The list of projects supporting TinkerPop continues to grow to this day, with newly announced support from Tibco and ArcadeDB so far in 2021.
A number of additional products and tools, such as graph visualization libraries, have also been created that work with TinkerPop-compatible graph stores. Many of these tools have been created outside of the TinkerPop project itself. Others are not open source, but are free to use. And commercially produced products are available.
Solutions based on the Apache TinkerPop framework are in production at companies and institutions around the world, including Amundsen, Netflix, and Altimeter. Amundsen is an open source data discovery and metadata engine for improving the productivity of data analysts, data scientists, and engineers when interacting with data, and it supports Apache TinkerPop as a backend graph database. Netflix is building and scaling data lineage to improve their data infrastructure reliability and efficiency that uses Gremlin and a REST Lineage Service against a graph database. Altimeter is an open source project (MIT License) from Tableau Software, LLC that scans AWS resources and links these resources into a graph, which supports both Gremlin or RDF/SPARQL graphs.
Why do developers like Apache TinkerPop?
Most of the work for an application developer using TinkerPop to process graph data involves writing and testing Gremlin queries. The TinkerPop client libraries for Gremlin make it easier for programmers using Java or Python, for example, to include Gremlin steps as part of their code. This helps with coding, debugging, and testing, as Gremlin integrates seamlessly with a developer’s preferred IDE, such as VS Code or PyCharm.
Although queries also can be submitted as text strings, being able to write queries as part of their code makes the experience more familiar for developers. The IDE is able to provide syntax highlighting, code completion, and other help. For example, consider the following social network graph that includes links between people and the movies they like.
Using Python, finding the number of people living in Texas who like the Troop Zero movie might be written as:
The Gremlin query first finds the vertex in the graph that represents Troop Zero and then looks for incoming edges with a label of likes
. Having found all the people who like Troop Zero, only the ones living in Texas are counted, yielding an answer of 1 for fans
. For a programmer, the Gremlin steps needed to express the query and the rest of their code coexist seamlessly.
Gremlin users also tend to enjoy the language itself, where its navigation-like syntax generates a feeling of movement over the structure of the graph, which helps when thinking about a particular query. This sense of motion is so natural to graphs and Gremlin that queries are referred to as graph traversals, because you can envision traversing from a vertex to an edge to another vertex and so on, collecting data along this path to form the result. The Gremlin language itself has a functional, stream-oriented syntax that not only allows graph navigation, but also enables filtering, branching, and data transformation functions, which allow complex traversal algorithms development.
The following image demonstrates Gremlin traversing the preceding example graph with the sample query, where Gremlin is moving about the graph by taking instruction from the traversal. Pay attention to the highlighted portions of the traversal and the corresponding actions Gremlin is taking on the graph.
The TinkerPop framework and protocols also provide a high level of code portability. An application written using one of the Gremlin language client drivers is likely to work—mostly or completely unchanged—with any graph database that implements the Gremlin Server protocols and the TinkerPop Graph API.
Gremlin programming language clients
The Apache TinkerPop project provides client drivers for several popular programming languages and frameworks, including Java, Python, .Net, JavaScript, and Groovy. The community has created additional clients and frameworks to use TinkerPop with other languages.
How does AWS contribute to Apache TinkerPop?
AWS employees have contributed code, documentation, mailing list support, and other collateral to the Apache TinkerPop project, and AWS employees are among the list of official committers to the project.
In addition to directly contributing to the project, AWS has also contributed to the wider TinkerPop community by introducing the open source Graph Notebook, which is a graph-oriented extension to Jupyter Notebook providing support for exploring graphs with Gremlin, but with further support for SPARQL and openCypher. The Graph Notebook gives graph analysts the ability to use Jupyter with any TinkerPop-enabled graph system, further expanding the list of tools available to TinkerPop users.
A common way to work with Apache TinkerPop in an on-premises environment is to configure an architecture in which long-running applications, perhaps written in Java, interact with a Gremlin server that in turn communicates with a graph database engine and a storage system. That architecture could also be moved to the cloud as-is running on Amazon Elastic Compute Cloud (Amazon EC2) instances (or their equivalent in other clouds).
Another way to work with TinkerPop is to use Amazon Neptune, a fully managed graph database service from AWS. When we launched Neptune, we made a decision to support leading graph models and frameworks. Including the Apache TinkerPop framework and protocols was a natural choice for Neptune. AWS customers use the same open source client libraries and protocols to connect to Neptune as they would to connect to any other TinkerPop-enabled implementation. This has several advantages for Neptune users. For example, they do not need to learn a proprietary query language, and they do not need special software to connect to and use the service.
In an environment such as the AWS Cloud, using a database like Neptune, these architectures often change to take advantage of the elastic nature of new features and services. For example, long-running applications may be replaced by serverless ones managed by AWS Lambda. Lambda functions are typically short running and can have high concurrency. The original Gremlin client libraries work well in a fairly static environment, but in a dynamic environment in which everything is more elastic and resources (and even IP addresses) can change at any time, application code must also be able to adapt dynamically.
To make building graph applications easier, we’ve released additional open source tools and libraries as part of the AWSLabs GitHub repo. An example is a maven library to allow load balancing your Gremlin clients to make more efficient use of Neptune’s read replicas to horizontally scale read throughput.
Neptune team members are also advocating in the graph community to consider how different graph languages and models can work together. For example, the recently released support for openCypher in Neptune allows customers the ability to use Gremlin and openCypher over the same graph data, allowing openCypher customers the power to use Gremlin’s imperative traversal capabilities to do specific, programmatic graph operations.
Conclusion
In this post, we walked through the open source evolution of the Apache TinkerPop project and the broad adoption of its technologies by both commercial vendors and other open source projects. If you would like to help make the project even better, the TinkerPop committers welcome contributions, including documentation suggestions and updates, feature requests, bug reports, and code.
To learn more, visit the Apache TinkerPop site and find the project on GitHub.