Find and link similar entities in a knowledge graph using Amazon Neptune, Part 2: Vector similarity search

A knowledge graph combines data from many sources and links related entities. Because a knowledge graph is a gathering place for connected data, we expect many of its entities to be similar. When we find that two entities are similar to each other, we can materialize that fact as a relationship between them.

In this two-part series, we demonstrate how to find and link similar entities in Amazon Neptune, a managed graph database service. In part 1, we used lexical search to find entities with similar text. In this post, we use semantic search to find entities with similar meaning. In both cases, we demonstrate a convention to link similar entities using an edge.

Semantic search

A semantic search is the ability to find entities with similar meaning. This approach uses a machine learning (ML) model to create a numeric vector embedding of the entity. A vector is a list of numbers, each of which encodes some aspect of the entity’s meaning. The model creates similar vectors for entities with similar meaning. If we plot vectors on a graph, the closer those vectors are to each other, the more similar they are. Vector databases enable us to store vectors and run distance or similarity queries on them.

To demonstrate semantic search, we use a publication dataset, specifically the DBLP ACM dataset, which contains citation data from computer science publication sources such as DBLP and Association for Computing Machinery (ACM). Our graph data model is shown in the next diagram.

We maintain Publication nodes and Author nodes. We link a publication to its authors using the hasAuthor edge. Publication nodes have several properties, including title, venue, year, and source (whose value is either DBLP or ACM). We associate an embedding with each publication to facilitate comparison of publications. There are many duplicate publications, most commonly publications with similar titles and authors from different sources. The preceding figure shows publications with similar titles but different sources. They share the same three authors.

In the next figure, we see two publications with titles that are not lexically similar, but have similar subject matter. Tetanus and lockjaw refer to the same disease; and emerge and ER are synonymous. For this reason, we expect their embeddings to be similar. To link the publications, we perform additional verification, including a comparison of authors.

DBLP/ACM dataset with semantically similar publications

Note that the overall dataset is public, but these two publications are mock publications. We add them to the dataset for testing.

To build this, we use Neptune Analytics, a memory-optimized graph engine for graph analytics. We first generate the embeddings using a pre-trained Bidirectional Encoder Representations (BERT) sentence transformer model. Then we populate a Neptune Analytics graph with publication data plus embeddings. We use an Amazon SageMaker notebook instance as a client to query the Neptune graph to resolve entities.

Our design

To run the example, you need a Neptune Analytics graph and a notebook instance. Follow along in FindAndLink-VSS.ipynb to explore the semantic similarity of publications. It includes setup instructions.

Creating embeddings

Let’s look at some highlights. First, we use the following code to create embeddings:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# sentences - one or list of text sentences
def get_embeddings(sentences):
   embeddings = model.encode(sentences)
   return embeddings

# convert embedding to semi-colon sep list for Neptune load format
def get_str_embedding(embedding):
   return ';'.join([str(x) for x in embedding])

The get_embeddings function takes as input a snippet (or a list of snippets) of text and returns a vector (or, if a list of snippets was passed, a list of vectors). It uses the Sentence Transformers library. A snippet doesn’t need to be a grammatical sentence. We use it for publication titles. The get_str_embedding function converts the vector to the stringified format required to load into Neptune Analytics: a semicolon-separated list of numeric items in the vector.

We create three CSV files to load to the Neptune graph: publications, authors, and hasAuthor edges. The following is an excerpt of the publication CSV.

CSV file with embeddings

Each row in the publication CSV represents a single publication. Notice the rightmost column is a stringified embedding, created using a sentence transformer, of the publication title.

topKByNode

Now we can compare publication nodes to each other using the topKByNode function. Neptune Analytics supports OpenCypher query language. The following is an OpenCypher query that calls topKByNode:

// 1. Find candidate publication by its ID. Also find related authors.
// We’ll refer to the candidate as p and the related author as pa.
MATCH (p:Publication)-[:hasAuthor]->(pa:Author)
WHERE id(p)=$candidate

// 2. Find publication nodes whose embedding is similar to that of p.
// Get back the matching node and score
CALL neptune.algo.vectors.topKByNode(p)
YIELD node, score

// 3. Pass along candidate ID, title, author names, matching node, and score
WITH id(p) as candidate, p.title as candidateTitle, collect(pa.name) as candidateAuthors, node, score

// 4. Get related authors, na, of the matching node
MATCH (node)-[:hasAuthor]->(na:Author)

// 5. Return candidate and matching data for comparison
RETURN candidate, id(node) as matched, candidateTitle, node.title as matchedTitle, score,
candidateAuthors, collect(na.name) as matchedAuthors

// 6. Order by score. Lower score means closer match.
// Ascending order by default.
ORDER by score

As the comments indicate, there are six steps:

Find the candidate publication (the publication we want to compare with others) and its related authors.
Call topKByNode to find publications with embeddings similar to that of the candidate.
Keep track of candidate and matching publications, as well as the similarity score.
Find related authors for the matching nodes.
Return candidate publications, candidate authors, matching publications, matching authors, and score.
Order the results by score.

Vector similarity alone isn’t enough to compare publications. We also need to check for common relationships in the graph. For example, if we look for publications similar to Design and findings and next steps for Java implementation of Telegraph Dataflow, we find Java support for data-intensive systems: experiences building the telegraph dataflow system. But these two publications have different authors, so they are unlikely matches.

VSS search - semantically similar but different authors

However, if we look for publications similar to Tetanus cases increase Emerg units in September, we find ER staff reports lockjaw diagnoses rose 5 percent after Labor Day with the same author. This doesn’t ensure they are the same publication, but that inference has greater probability.

VSS results with similar title and authors

topKByEmbedding

We can use the topKByEmbedding function to find publications similar to a search term. For example, to find publications like Tetanus cases September, we first generate the embedding for that term. We use the same sentence transform as we did for the node embeddings:

embparams={'emb': get_embeddings('Tetanus cases September').tolist()}

Then we search for publications with similar embeddings using topKByEmbedding. For those matching nodes, we also find related authors:

WITH $emb as emb

CALL neptune.algo.vectors.topKByEmbedding(emb)
YIELD embedding, node, score
WITH node, score, embedding

MATCH (node)-[:hasAuthor]->(a:Author)
RETURN id(node) as matched, node.title as matchedTitle, score, collect(a.name) as authors
ORDER by score

The top matches are, as we discussed earlier, semantically similar but lexically dissimilar.

Close semantic match even if lexically different

The notebook also demonstrates how to create a match relationship between publications that we decide are duplicates because they are semantically similar and have common authors. To manually link nodes http://example.org/pubgraph/ACM/604282 and http://example.org/pubgraph/DBLP/journals/sigmod/ShahMFH01 using a matches edge, use the following OpenCypher query:

MATCH(p1)
WHERE id(p1)='http://example.org/pubgraph/ACM/604282'

MATCH(p2)
WHERE id(p2)='http://example.org/pubgraph/DBLP/journals/sigmod/ShahMFH01'

CREATE (p1)-[e:matches {matchSource: "user", matchAlg: "vss"}]->(p2)
RETURN e

This follows the same linking convention as patients.

Semantic search in OpenSearch Service

Amazon OpenSearch Service also provides semantic search. You can load embeddings into an k-nearest neighbors index in an OpenSearch domain and perform vector similarity queries against that index. In an upcoming post, we use this capability in conjunction with Neptune.

Additional approaches

Alternatives to search include the following:

Deduplication groups similar entities in batches, typically as part of an extract, transform, and load (ETL) process. Deduplication is often used in Master Data Management (MDM) to help find the golden record. It could be used to discover and link duplicate entities prior to loading them into the Neptune database. Amazon Entity Resolution is an AWS service providing deduplication.
Graph similarity algorithms, available in Neptune Analytics, find similar graph nodes based on intersecting relationships. The following sample notebook demonstrates algorithms such as Jaccard Similarity in an air routes graph.
Neptune ML uses data currently in the graph to train an ML model in SageMaker to make further predictions about the graph. Notably, Neptune ML offers link prediction, or the prediction of an edge between nodes. You could use this capability to find similar entities, for example, by predicting a link between them.
Because the Neptune ML model generates embeddings, you can perform a vector similarity search, in a Neptune Analytics graph, on those embeddings! Significantly, the embeddings are graph aware. The embedding for a node represents not just the node’s properties but also the node’s neighborhood in the graph. Nodes with similar GNN embeddings are graphically similar.

Clean up

If you set up a Neptune Analytics graph or notebook instance to follow along, delete those resources to avoid further costs. The notebooks have cleanup instructions.

Conclusion

A knowledge graph gathers entities from numerous sources. In this post, we demonstrated the use of vector similarity search in Neptune Analytics to find semantically similar entities. We also showed a conventional way to link similar entities.

In In part 1, we showed how to use lexical search to find similar entities in a knowledge graph running on Neptune. We combined Neptune with OpenSearch Service using a feature called full text search.

Experiment with this method using the notebooks provided. Think about how to more effectively query and draw relationships in your knowledge graph using the approaches discussed, and share your thoughts in the comments section.

About the Author

Mike Havey is a Senior Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles. His Amazon author page is https://www.amazon.com/Michael-Havey/e/B001IO9JBI.