AWS Database Blog

Amazon Neptune now supports TinkerPop 3.4 features

Amazon Neptune now supports the Apache TinkerPop 3.4.1 release. In this post, you will find examples of new features in the Gremlin query and traversal language such as text predicates, changes to valueMap, nested repeat steps, named repeat steps, non-numerical comparisons, and changes to the order step. It is worth pointing out that TinkerPop 3.4 has a few important differences from TinkerPop 3.3. Be sure to review the compatibility notes in the engine releases documentation.

All of the latest features and improvements in the engine are documented on the Amazon Neptune Releases page.

Setting up a test cluster

You can try out the examples in this post by following the steps below. This post builds upon two prior posts; Analyze Amazon Neptune Graphs using Amazon SageMaker Jupyter Notebooks and Let Me Graph That For You – Part 1 – Air Routes, and again takes advantage of the air-routes dataset.

The air-routes data used in this example is available in GitHub here.

The examples shown below require that Gremlin Python be at the 3.4 level or higher. If you used the AWS CloudFormation templates from our previous posts to generate a set of notebooks, and an Amazon SageMaker instance, you must update the level of Gremlin Python running this command from a Terminal window (inside the notebook) or from a notebook cell prefixed with %%bash.

/home/ec2-user/anaconda3/bin/python3 -m  pip install --target 
/home/ec2user/anaconda3/envs/python3/lib/python3.6/site-packages/ gremlinpython

You only must do this if you kept an instance from before. If you were to re-run the AWS CloudFormation script again, it now installs the latest Gremlin Python libraries for you. Next, let’s import the required classes and establish a connection to the Neptune cluster.

Import the key GremlinPython classes

We now must import some classes from those libraries before we can connect to our Neptune instance from our Python code. Python has a number of reserved words that have the same name as Gremlin query steps, so when needed, those steps must be postfixed with an underscore character when using Gremlin Python. For example, the Gremlin in() step is written as in_().

In [34]:from gremlin_python import statics
		from gremlin_python.structure.graph import Graph
		from gremlin_python.process.graph_traversal import __
		from gremlin_python.process.strategies import *
		from gremlin_python.process.traversal import *
		from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection

Establish access to our Neptune instance

Before we can work with our graph, we must establish a connection to it. This is done using theDriverRemoteConnection capability as defined by Apache TinkerPop and supported by GremlinPython. Once this cell has been run, we are able to use the variable g to refer to our graph in Gremlin queries in subsequent cells. By default Neptune uses port 8182 and that is what we connect to below. When you configure your own Neptune instance, you can choose a different port number. In that case, you would replace port 8182 in the example below with the port you configured.

In [73]:# Create a string containing the full Web Socket path to the endpoint
		# Replace <your-neptune-instance-name> with the name of your Neptune instance.
		# It will be of the form "myinstance.us-east-1.neptune.amazonaws.com"

		#neptune_endpoint = '<your-neptune-instance-name>'
		neptune_gremlin_endpoint = 'wss://' + neptune_endpoint + ':8182/gremlin'

		# Obtain a graph traversal source object for the remote endpoint.
		graph = Graph()
		g = graph.traversal()\
		.withRemote(DriverRemoteConnection(neptune_gremlin_endpoint,'g'))

Testing our connection

Let’s start off with a simple query just to make sure our connection to Neptune is working. The queries below look at all of the vertices and edges in the graph and create two maps that show an aggregation of the graph. As we are using the air-routes dataset, the values returned are related to airports and routes. You can use these values to help verify the results of other examples contained in this post.

In [73]:vertices = g.V().groupCount().by(T.label).toList()
		edges = g.E().groupCount().by(T.label).toList()
		print(vertices)
		print(edges)

		[{'continent': 7, 'country': 237, 'test999': 1, 'version': 1, 'airport': 3442}]
		[{'contains': 6884, 'route': 49879}]

Text predicates

Many customers are excited about TinkerPop 3.4’s new “predicates” feature to enable more focused text searches. This works well when you are building an application and want your traversals to take advantage of the text values of your properties. For example, find me all the cities that start with “Dal”. In total, six new predicates have been added.

  • startingWith
  • endingWith
  • containing
  • notStartingWith
  • notEndingWith
  • notContaining

All of these new predicates are case-sensitive.

startingWith

The example below looks for any cities with names starting with “dal”. A dedup step is used to get rid of any duplicate names.

In [4]:g.V().hasLabel('airport')\
		.has('city',TextP.startingWith('Dal'))\
		.values('city').dedup().toList()
Out[4]:['Dalat', 'Dallas', 'Dalcahue', 'Dalaman', 'Dalian', 'Dalanzadgad']

As the text predicates are case-sensitive, if we look for cities that have names starting with “Dal” we will not find any.

In [5]:g.V().hasLabel('airport').has('city',startingWith('dal')).count().next()
Out[5]:0

If you want to check for both ‘Dal’ or ‘dal’, you can do that using an or step and two has steps as shown below.

In [16]:g.V().hasLabel('airport')\
		.or_(__.has('city',startingWith('dal')),\
		__.has('city',startingWith('Dal')))\
		.dedup().by('city')\
		.count()\
		.next()
Out[16]:6

notStartingWith

Each of the text predicates has an inverse step. We can use the notStartingWith step to look for city names that do not start with “Dal”.

In [7]:g.V().hasLabel('airport').has('city',notStartingWith('Dal')).count().next()
Out[7]:3434

The example above returns the same results we would get if we were to negate a startingWith step as shown below.

In [8]:g.V().hasLabel('airport').not_(__.has('city',startingWith('Dal'))).count().nex t()
Out[8]:3434

endingWith

The example below looks for any city names ending with that characters “zhi”.

In [9]:g.V().hasLabel('airport').has('city',endingWith('zhi')).values('city').toList ()
Out[9]:['Changzhi']

notEndingWith

Using notEndingWith we can easily find cities whose names do not end with “zhi”.

In [10]:g.V().hasLabel('airport').has('city',notEndingWith('zhi')).count().next()
Out[10]:3440

containing

We can also look for cities whose names contain a certain string. The example below looks for any cities with the string “gzh” in their name.

In [11]:g.V().hasLabel('airport').has('city',containing('gzh'))\
		.values('city').toList()
Out[11]:['Zhengzhou',
		'Guangzhou',
		'Yongzhou',
		'Hangzhou',
		'Changzhou',
		'Yangzho',
		'Changzhi']

notContaining

The example below chains together a number of has steps using notContaining predicates to find cities with names containing no basic, lowercase, vowels commonly used in the English language

In [11]:g.V().hasLabel('airport')\
		.has('city',notContaining('e'))\
		.has('city',notContaining('a'))\
		.has('city',notContaining('i'))\
		.has('city',notContaining('u'))\
		.has('city',notContaining('o'))\
		.values('city')\
		.dedup()\
		.toList()
Out[11]:['Orsk',
		'Łódź',
		'Røst',
		'Iğdır',
		'Osh',
		'Kyzyl',
		'Omsk',
		'Växjö',
		'Brønnøy',
		'Mörön',
		'Årø']

Changes to valueMap

Apache TinkerPop 3.4 introduced changes and new capabilities to the way that a valueMap step is used. In general, a valueMap step returns a set of key-value pairs as shown below. By default all values are shown as members of lists. This is the same behavior found in earlier versions of TinkerPop.

In [13]:g.V().has('code','AUS').valueMap().next()
Out[13]:{'country': ['US'],
		'code': ['AUS'],
		'longest': [12250],
		'city': ['Austin'],
		'elev': [542],
		'icao': ['KAUS'],
		'lon': [-97.6698989868164],
		'runways': [2],
		'type': ['airport'],
		'region': ['US-TX'],
		'lat': [30.1944999694824],
		'desc': ['Austin Bergstrom International Airport']}

TinkerPop 3.4 added the ability to have the results of a valueMap step returned without the values presented in lists using the new predicate.

In [14]:g.V().has('code','AUS').valueMap().by(__.unfold()).next()
Out[14]:{'code': 'AUS',
		'type': 'airport',
		'desc': 'Austin Bergstrom International Airport',
		'country': 'US',
		'longest': 12250,
		'city': 'Austin',
		'lon': -97.6698989868164,
		'elev': 542,
		'icao': 'KAUS',
		'region': 'US-TX',
		'runways': 2,
		'lat': 30.1944999694824}

As in prior releases, you can be more specific about the property keys you are interested in and unfold the results. This is a best practice for getting the best performance for your traversals.

In [15]:g.V().has('code','AUS').valueMap('icao').by(__.unfold()).next()
Out[15]:{'icao': 'KAUS'}

You can also select specific keys to return just the values without their associated key names.

In [16]:g.V().has('code','AUS').valueMap().by(__.unfold()).select('icao').next()
Out[16]:'KAUS'

Before Apache TInkerPop 3.4, in order to have the ID and label of a vertex or edge included in valueMap results, you would use the valueMap(true) construction as shown below.

In [17]:g.V().has('code','AUS').valueMap(True).toList()
Out[17]:[{'country': ['US'],
		'code': ['AUS'],
		'longest': [12250],
		'city': ['Austin'],
		<T.label: 3>: 'airport',
		<T.id: 1>: '3',
		'lon': [-97.6698989868164],
		'type': ['airport'],
		'elev': [542],
		'icao': ['KAUS'],
		'runways': [2],
		'region': ['US-TX'],
		'lat': [30.1944999694824],
		'desc': ['Austin Bergstrom International Airport']}]

The use of valueMap(true) is now deprecated. Instead, the new with step allows us to specify what we want returned using the WithOptions enumeration.

In [18]:g.V().has('code','AUS').valueMap().with_(WithOptions.tokens).toList()
Out[18]:[{'country': ['US'],
		'code': ['AUS'],
		'longest': [12250],
		'city': ['Austin'],
		<T.label: 3>: 'airport',
		<T.id: 1>: '3',
		'lon': [-97.6698989868164],
		'type': ['airport'],
		'elev': [542],
		'icao': ['KAUS'],
		'runways': [2],
		'region': ['US-TX'],
		'lat': [30.1944999694824],
		'desc': ['Austin Bergstrom International Airport']}]

The results can be unfolded as in the prior examples.

In [19]:g.V().has('code','AUS').valueMap().by(__.unfold())\
		.with_(WithOptions.tokens).toList()
Out[19]:[{<T.id: 1>: '3',
		<T.label: 3>: 'airport',
		'code': 'AUS',
		'type': 'airport',
		'desc': 'Austin Bergstrom International Airport',
		'country': 'US',
		'longest': 12250,
		'city': 'Austin',
		'lon': -97.6698989868164,
		'elev': 542,
		'icao': 'KAUS',
		'region': 'US-TX',
		'runways': 2,
		'lat': 30.1944999694824}]

Adding a numerical index to a collection

The new index step allows anything that is a collection, such as the results of a fold step, to have a numerical index value associated with each entry in the collection. The first index value is always zero and the increment is always 1.

In [18]:g.V().has('airport','region','US-TX').limit(5)\
		.values('code')\
		.fold().index()\
		.next()
Out[18]:[['AFW', 0], ['DRT', 1], ['AUS', 2], ['DFW', 3], ['IAH', 4]]

A with step can be used to control the type of index that is created. The default is list, but you can also ask for the indexed values to be returned as a map. The index is the map’s key and the original values are mapped against those keys.

In [19]:g.V().has('airport','region','US-TX').limit(5)\
		.values('code')\
		.fold().index().with_(WithOptions.indexer,WithOptions.map)\
		.next()
Out[19]:{0: 'AFW', 1: 'DRT', 2: 'AUS', 3: 'DFW', 4: 'IAH'}

While list is the default indexing mode, you can explicitly request it using a with step.

In [20]:g.V().has('airport','region','US-TX').limit(5)\
		.values('code')\
		.fold().index().with_(WithOptions.indexer,WithOptions.list)\
		.next()
Out[20]:[['AFW', 0], ['DRT', 1], ['AUS', 2], ['DFW', 3], ['IAH', 4]]

The index values can be accessed from a query. The example below uses the index value to return the results in reverse order.

In [24]:g.V().has('airport','region','US-TX').limit(5)\
		.values('code')\
		.fold().index()\
		.unfold().order().by(__.tail(Scope.local,1),Order.desc)\
		.toList()
Out[24]:[['IAH', 4], ['DFW', 3], ['AUS', 2], ['DRT', 1], ['AFW', 0]]

The example below applies an index step to the results generated by a group step.

In [23]:g.V().has('region',P.within('US-NM','US-OK','US-AR'))\
		.group().by('region').by('city')\
		.index()\
		.next()
Out[23]:[[{'US-AR': ['Harrison',
		'Little Rock',
		'Texarkana',
		'Fort Smith',
		'El Dorado',
		'Hot Springs',
		'Jonesboro',
		'Fayetteville/Springdale/']},
		0],
		[{'US-NM': ['Clovis',
		'Santa Fe',
		'Albuquerque',
		'Carlsbad',
		'Los Alamos',
		'Roswell',
		'Farmington',
		'Hobbs',
		'Silver City']},
		1],
		[{'US-OK': ['Oaklahoma City', 'Tulsa', 'Stillwater', 'Lawton']}, 2]]

Nested repeat steps

Gremlin repeat steps can now be nested inside other repeat steps or inside emit and until steps. The example below starts at the Austin airport, traverses out one time, and for each airport found looks at the incoming routes to a depth of two.

In [59]:paths = (g.V().has('code','AUS').
		repeat(__.out('route').
		repeat(__.in_('route')).times(2)).
		times(1).
		path().by('code').
		limit(10).
		toList())
		for p in paths:
		str = '{} -> {} <- {} <- {}'.format(p[0],p[1],p[2],p[3])
		print(str)
	
		AUS -> ATL <- SLC <- LAS
		AUS -> ATL <- SLC <- DEN
		AUS -> ATL <- SLC <- SAT
		AUS -> ATL <- SLC <- MSY
		AUS -> ATL <- SLC <- CDG
		AUS -> ATL <- SLC <- RSW
		AUS -> ATL <- SLC <- MKE
		AUS -> ATL <- SLC <- MDW
		AUS -> ATL <- SLC <- OMA
		AUS -> ATL <- SLC <- TUL

Named repeat steps

As well as being nested, each repeat step can now be given an optional name. This allows it to be referred to later inside of a loops step. The example below shows named repeat steps being used. In this particular case that naming could have been omitted, but this demonstrates the capability. The ability to name repeat steps is intended for cases where those steps are also being nested.

In [59]:paths = (g.V('3').repeat('r1',__.out().simplePath()).
		until(__.loops('r1').is_(3)).
		path().by('code').
		limit(3).
		toList())
		for p in paths:
		str = '{} -> {} -> {}'.format(p[0],p[1],p[2])
		print(str)

		AUS -> ATL -> BNA
		AUS -> ATL -> BNA
		AUS -> ATL -> BNA

Non-numeric comparisons

Before TinkerPop 3.4, the min and max steps could only be applied to numeric values. They can now be applied to anything that is considered “comparable” such as text strings. This is a little simpler than having to order a result set and select the first or last value.

In [67]:g.V().hasLabel('continent').values('desc').order().toList()
Out[67]:['Africa',
		'Antarctica',
		'Asia',
		'Europe',
		'North America',
		'Oceania',
		'South America']
In [69]:g.V().hasLabel('continent').values('desc').order().limit(1).next()
Out[69]:'Africa'
In [70]:g.V().hasLabel('continent').values('desc').order().tail(1).next()
Out[70]:'South America'
In [71]:g.V().hasLabel('continent').values('desc').min().next()
Out[71]:'Africa'
In [72]:g.V().hasLabel('continent').values('desc').max().next()
Out[72]:'South America'

Changes to order

The previous Order.incr and Order.decr enumerations are now deprecated in favor of Order.asc and Order.desc. This change makes Gremlin’s terminology more consistent with other database query languages. These changes were released before TinkerPop 3.4, but are now also supported by Amazon Neptune.

In [90]:g.V().has('region','US-NM').values('city').order().by(Order.asc).toList()
Out[90]:['Albuquerque',
		'Carlsbad',
		'Clovis',
		'Farmington',
		'Hobbs',
		'Los Alamos',
		'Roswell',
		'Santa Fe',
		'Silver City']
In [68]:g.V().has('region','US-NM').values('city').order().by(Order.desc).toList()
Out[68]:['Silver City',
		'Santa Fe',
		'Roswell',
		'Los Alamos',
		'Hobbs',
		'Farmington',
		'Clovis',
		'Carlsbad',
		'Albuquerque']

Changes to bulkSet

TinkerPop 3.4 adds bulkSet as a GraphSON type instead of coercing it to a List type. Before, TinkerPop3.4 the query results were serialized as flattened lists. Older Gremlin clients may not be able to recognize the change. The TinkerPop 3.4 BulkSet documentation also calls out the details.

Conclusion

We are excited to support the Apache TinkerPop 3.4 release in Amazon Neptune and highly encourage you to create a cluster, as mentioned in the steps above, and run through the examples. Let us know your feedback through the comments in this post or through our Amazon Neptune Discussion Forum.

 

 


About the Author

Kelvin Lawrence is a Principal Data Architect in the Database Services Customer Advisory Team focused on Amazon Neptune and many other related services. He has been working with graph databases for many years, is the author of the book “Practical Gremlin” and is a committer on the Apache TinkerPop project.