Exploring the feature packed 1.2.1.0 release for Amazon Neptune

In this post, we describe all the features that have been released as part of the recent 1.2.1.0 engine update to Amazon Neptune.

Amazon Neptune is a fast, reliable, and fully managed graph database service for building and running applications with highly connected datasets, such as knowledge graphs, fraud graphs, identity graphs, and security graphs. Neptune provides developers the most choice for building graph applications with three open graph query languages: openCypher, Apache TinkerPop Gremlin, and the World Wide Web Consortium’s (W3C) SPARQL 1.1.

Neptune announced the general availability of the latest engine release to 1.2.1.0 on 13^th June 2023. With this release, you can benefit from a variety of new features and improvements, including support for Apache Tinkerpop 3.6.2, access to R6i instances, a graph summary API, slow query logging capabilities, and additional openCypher language functions and improved performance.

Apache Tinkerpop 3.6.2

Neptune support for Apache Tinkerpop 3.6.x introduces new Gremlin steps, and updates to existing options and modulators.

mergeV() and mergeE()

For workloads that use upsert-like functionality for vertices and edges, prior to Tinkerpop 3.6.x you would use the fold().coalesce(unfold(), ...) pattern to determine if an object exists, create it if doesn’t, then update it as necessary. The mergeV() and mergeE() steps simplify this process for create if not exists type queries.

As an example of mergeV(), consider the fold-coalesce-unfold pattern for upserting a Vertex. The following query checks the existence of a vertex labelled with ‘airport’ with a ‘code’ property value of ATL, and creates it if it doesn’t exist:

g.V().has('airport','code','ATL').
fold().
coalesce(unfold(),
addV('airport').property(T.id, '215').property('code','ATL')).
valueMap(true)
==>{code=[ATL], id=215, label=airport}

Consider the same example using mergeV() in 3.6.x:

g.mergeV([(T.label): 'airport', (T.id): '215', code: 'ATL']).valueMap(true)
==>{code=[ATL], id=215, label=airport}

This makes your code easier to both read and write, and allows Neptune to better optimize for mutation performance. Multi-label support for the mergeV() step has also been added, simplifying the process of applying multiple labels to graph objects during object creation and updates.

g.mergeV([(T.label): ['airport', 'city', 'venue'], (T.id): '215', code: 'ATL']).valueMap(true)
==>{code=[ATL], id=215, label=[airport, city, venue]}

If you have a requirement to set different properties during an object’s lifecycle, for example, when it is created, or when it is updated, you can also use the mergeV() and mergeE() steps combined with the onCreate and onMatch options.

The following is an example using the mergeV() step to set the create_date property when the object is first created, and the update_date property when it is updated.

g.mergeV([(T.label): ['airport'], (T.id): '215']).
option(onCreate, [create_date: '2023-07-20']).
option(onMatch, [update_date: '2023-07-20']).
valueMap()

This is an example of using the mergeE() step to update the distance property between creation and updating:

g.mergeE([(T.id): 'e-1']).
option(onCreate, [(from): '215', (to): '216', distance: 100]).
option(onMatch, [distance: 200]).
valueMap()

element()

The element() step allows you to traverse from a property back to its parent element, be that a Vertex or Edge. For more information, refer to the reference documentation on element. The following is an example of using the element() step:

g.V('215').properties('city').element()

The advantage of using the element() step is it provides a convenient way of traversing back to the parent object rather than having to manually track it, resulting in more readable queries.

fail()

You can use the new fail() step if you need to halt your query if a specific condition is met. fail() immediately stops the traversal and throws an exception with a provided message. This is useful when debugging queries, or for providing better exception reporting when an unknown scenario has been reached. The following is an example of using the fail() step to throw an exception when the vertex for ‘Kevin’ doesn’t exist:

g.mergeV([(T.id): 'Kevin', (T.label): 'person']).
option(onCreate, fail('Person Kevin should exist')).
option(onMatch, [age: 21])

TextP.regex()

The regex predicate was added to TextP to provide a mechanism enabling you to build predicates that filter on string values using regular expressions:

g.V().values('code')
==>AUS
==>ATL
==>JFK
==>LAX
==>LHR

The following is an example using the TextP.regex predicate to find airport vertices with a code property starting with the letter L:

g.V().has('airport', 'code', TextP.regex('^L')).values('code')
==>LAX
==>LHR

In addition to supporting the regex() predicate, Neptune also supports notRegex(). This determines if a string value has no match with the specified regular expression pattern:

g.V().has('airport', 'code', TextP.notRegex('^L')).values('code')
==>AUS
==>ATL
==>JFK

For more examples of using Regular Expressions refer to the online documentation.

property(Map)

Prior to the Apache Tinkerpop 3.6 release, updating properties in Gremlin required chaining multiple steps together. The following is an example of this pattern:

g.addV('airport')
.property('code', 'LHR')
.property('name', 'London Heathrow Airport')

In many cases, applications send updates to the database in the form of a collection of property updates, or a Map. To update an object with all the given properties within the collection, this would need to be iterated over, either in code or within Gremlin itself. With the introduction of support for providing Map collections to the property step, you can now write more readable code without the need to manually iterate over the collection prior to updating:

g.addV('airport')
.property([code: 'LHR',
name: 'London Heathrow Airport'])

For more information on these updates and upgrade considerations, refer to the Exploring new features in Apache Tinkerpop 3.6.x in Amazon Neptune blog post by Stephen Mallette, a long-time contributor to the Apache TinkerPop project and member of the Amazon Neptune service team.

R6i Instances

Continuing our goal of delivering better price performance for customers, engine version 1.2.x.x now supports R6i instances. R6 instances are powered by 3rd generation Xeon Scalable processors, and are the 6th generation of Amazon Elastic Compute Cloud (Amazon EC2) memory optimized instances, designed for memory-intensive workloads.

R6i instances provide up to 50 Gbps of networking speed, twice that of existing R5 instances, and up to 20% higher memory bandwidth per vCPU compared to R5 instances. The R6i instances also deliver up to 15% better price-performance when compared to previous generation R5 instances, to help power your graph use cases. They are also priced at parity with the R5 instances.

Graph summary API

Customers asked us for a quick and simple way to retrieve the metadata about their Neptune graphs, such as a list of distinct vertex labels and distinct edge labels for property graphs or a count of subjects and predicates for their RDF graphs. This information is useful for providing a high-level view of the domain information to users, estimating the size of a graph, efficiently indexing data when running ETL (extract, transform, and load) jobs, or plugging into visualization and business intelligence (BI) applications powered by Neptune. With the graph summary API, you can send HTTP GET requests to get a report from the following endpoints:

https://your-neptune-host:port/rdf/statistics/summary

https://your-neptune-host:port/pg/statistics/summary

In addition, Neptune Notebooks now support the new %summary and %statistics magics that you can use to retrieve graph summary information. Neptune automatically generates this API response as part of the statistics feature of the Neptune alternative query engine (DFE), which uses instance resources such as CPU cores, memory, and I/O more efficiently than the original Neptune engine. Just like DFE statistics, summary API responses are computed whenever more than 10% of data in your graph has changed or when the latest statistics are more than 10 days old. In addition, you can use the API to manually trigger a statistics update right before retrieving the summary. For more information, refer to the Getting a quick summary report about your graph.

Slow query logging

Neptune customers run millions of queries every day to derive insights from connections in their data. Many customers have applications that generate queries dynamically in response to user interactions. In such cases, customers asked us for increased visibility into query performance for queries that are taking longer than expected. To meet this requirement, we added support for slow query logs. You can now identify slow-running queries and log runtime details for these queries’ key performance indicators such as query runtime, waiting time in queue, index scan details, memory stats, and response codes to Amazon CloudWatch Logs.

Slow query logs are disabled by default, so to enable this functionality you must update the neptune_enable_slow_query_log database cluster parameter. To do so, change this setting to info or debug. The info setting logs a few useful attributes of each slow-running query, whereas the debug setting logs all available attributes.

To set the threshold that is used to identify slow running queries, you must set the neptune_slow_query_log_threshold database cluster parameter. This is the number of milliseconds after which a running query is considered slow and is then logged. The default value is 5000 milliseconds (5 seconds).

The neptune_enable_slow_query_log and neptune_slow_query_log_threshold database cluster parameters are both dynamic parameters where changes are applied to your Neptune database almost immediately after they’re made without requiring a reboot.

You can use the AWS Management Console, the modify-db-cluster-parameter-group AWS CLI command, or the ModifyDBClusterParameterGroup API management function to make changes to the database parameters.

The following is an example of how to update a custom database cluster parameter group using the AWS CLI:

aws neptune modify-db-cluster-parameter-group
--db-cluster-parameter-group-name my_custom_parameter_group
--parameters '[
{
"ParameterName": "neptune_enable_slow_query_log",
"ParameterValue": "debug",
"ApplyMethod": "immediate"
},
	{
		"ParameterName": "neptune_slow_query_log_threshold”,
		"ParameterValue": "5000",
		"ApplyMethod": "immediate"
	}
]'

The following is an example of how to modify an existing database cluster to enable publishing of the slow query logs to CloudWatch using the AWS CLI:

aws neptune modify-db-cluster --region your_region
    --db-cluster-identifier my_db_cluster_id \
    --cloudwatch-logs-export-configuration '{"EnableLogTypes":["slowquery"]}'

After enabling publishing of slow query logs to CloudWatch, the database cluster will be in a pending maintenance state. The following AWS CLI command applies the changes immediately:

aws neptune apply-pending-maintenance-action 
--resource-identifier arn:aws:rds:your_region:123456789012:cluster:your_cluster_id 
--apply-action system-update --opt-in-type immediate

The following is an example of a slow query log using the debug mode:

{
  "requestResponseMetadata": {
    ...
  },
  "queryStats": {
    "query": "gremlin=g.V().has('code','AUS').repeat(out().simplePath()).until(has('code','AGR')).path().by('code').limit(20).fold()",
    "queryFingerprint":
	...
  ,
    "queryLanguage": "Gremlin"
  },
  "memoryStats": {
    "allocatedPermits": 20,
    "approximateUsedMemoryBytes": 14838
  },
  "queryTimeStats": {
    "startTime": "23/02/2023 11:42:52.657",
    "overallRunTimeMs": 2249,
    "executionTimeMs": 2229,
    "serializationTimeMs": 13
  },
  "statementCounters": {
    "read": 69979
  },
  "transactionCounters": {
    "committed": 1
  },
  "concurrentExecutionStats": {
    "acceptedQueryCountAtStart": 1
  },
  "queryBatchStats": {
    "queryProcessingBatchSize": 1000,
    "querySerialisationBatchSize": 1000
  },
  "storageCounters": {
    "statementsScannedInAllIndexes": 69979,
    "statementsScannedSPOGIndex": 44936,
    "statementsScannedPOGSIndex": 4,
    "statementsScannedGPSOIndex": 25039,
    "statementsReadInAllIndexes": 68566,
    "statementsReadSPOGIndex": 43544,
    "statementsReadPOGSIndex": 2,
    "statementsReadGPSOIndex": 25020,
    "accessPathSearches": 27,
    "fullyBoundedAccessPathSearches": 27,
    "dictionaryReadsFromValueToIdTable": 10,
    "dictionaryReadsFromIdToValueTable": 17,
    "rangeCountsInAllIndexes": 4
  }
}

For more information on each of the attributes included within the query log report, refer to Query attributes logged in debug mode.

Platform Improvements

As part of this release, a new enableInterContainerTrafficEncryption parameter to all Neptune ML APIs, which you can use to enable or disable inter-container traffic encryption in training and hyper-parameter tuning jobs.

openCypher improvements

Further improvements and bug fixes focusing primarily on openCypher have also been made, addressing language parity with openCypher v9 specification. In addition, we have added new functions and improved performance. Further support was added for aggregation functions like percentile percentileDisc() and standard deviation stDev(), as well as trigonometric functions acos(), atan(), asin(), sin(), tan(), pi(), degrees(), radians(), cos(), cot(), and atan2(). The randomUUID() function, used to generate random UUIDs, and the epochMillis() function, used to convert datetime to epochmillis, were also added.

The following is an example of how to use the new epochMillis function, converting a property value stored as datetime:

MATCH (n)-->(e)-->(m)
RETURN epochMillis(n.last_update_date_time)

The following is an example of using the randomUUID function to create a random id for a vertex:

WITH [
({ name: 'Kevin', id: randomUUID() }),
({ name: 'John', id: randomUUID() })
] AS p
UNWIND p as p2
RETURN p2

By using randomUUID() to generate the object ID rather than Neptune means you can use the property value in subsequent parts of your query.

Further improvements have been made to how Neptune processes openCypher queries, as well as how it optimizes CPU usage during query execution. Specific query patterns have also seen performance improvements such as:

Queries containing multiple update clauses,
Queries that use parameterization for Maps or list properties
Queries containing the WITH statement
Queries with filter

The following are examples of queries containing multiple update clauses:

MERGE (n {name: 'John'})
MERGE (m {name: 'Jim'})
MERGE (n)-[:knows {since: 2023}]→(m)

The following is an example of a query with a filter using multi-hop patterns containing cycles:

MATCH (n)-->()-->()-->(m) RETURN n, m

The following is an example of a query with list/map injection using parameterization:

idList = [person-1, person-2, person-3, person-4]

UNWIND $idList as id 
MATCH (n {`~id`: id}) 
RETURN n.name

This update also includes SPARQL performance improvements for Concise Bounded Description (CBD) queries, and queries containing numerous static inputs provided in the VALUES clause. For example;

SELECT ?n 
WHERE { 
VALUES (?name) { ("John") ("Jim") 
... many values ... } 
?n a ?n_type . 
?n ?name . }

Conclusion

At AWS, we always work backwards from the customer, and this latest release from Amazon Neptune delivers numerous customer-requested enhancements for building graph applications. Beyond the features listed, you can find a complete list of improvements and fixes in the release notes. Here are a few ways to get started with this release:

Create your first Neptune cluster as part of the AWS Free Tier
Upgrade your existing Neptune cluster to take advantage of the latest features
Use the open source graph-explorer application to quickly visualize and explore graphs on Neptune
Run the open source graph-notebook library on Jupyter or JupyterLabs notebooks to interactively query and build graph applications on Neptune

Leave your questions in the comments section.

About the authors

Joy Wang is a Senior Product Manager on the Amazon Neptune team since 2020. She is passionate about making graph databases easy to learn and use, and empowering users with getting the most insights out of their highly-connected data.

Navtanay Sinha is a Senior Product Manager on Amazon Neptune team. He works with graph technologies to help Amazon Neptune customers fully realize the potential of their graph database.

Andrea Nassisi is a Principal Product Manager on the Amazon Neptune team. His passion is democratizing technologies. His focus in the team is the openCypher language implementation, and enabling any developer to use machine learning on graphs.

Kevin Phillips is a Neptune Specialist Solutions Architect working in the UK at Amazon Web Services. He has 18 years of development and solutions architectural experience, which he uses to help support and guide customers.

AWS Database Blog