AWS Database Blog

Building and querying the AWS COVID-19 knowledge graph

This blog post details how to recreate the AWS COVID-19 knowledge graph (CKG) using AWS CloudFormation and Amazon Neptune, and query the graph using Jupyter notebooks hosted on Amazon SageMaker in your AWS account. The CKG aids in the exploration and analysis of the COVID-19 Open Research Dataset (CORD-19), hosted in the AWS COVID-19 data lake. The strength of the graph comes from the connections between scholarly articles, authors, scientific concepts, and institutions. The CKG also helps power the CORD-19 search page.

The AWS COVID-19 data lake is a publicly available, centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel coronavirus (SARS-CoV-2) and its associated illness, COVID-19. For more information, see A public data lake for analysis of COVID-19 data and Exploring the public AWS COVID-19 data lake.

The CKG is built using Neptune, the CORD-19 dataset, and the annotations from Amazon Comprehend Medical. As of April 17, 2020, the CORD-19 dataset consists of over 52,000 scholarly articles, of which 41,000 are full text. The data is sourced from several channels, such as PubMed, bioArxiv, and medRxiv. The dataset continues to grow, and the Allen Institute for AI is diligently working with the wider research community to normalize and improve the quality of the data. The dataset is multidisciplinary, with topics including virology and translational medicine, and epidemiology.

Graph structure

The CKG is a directed graph. The following table summarizes the relationship between nodes and edges. The directed edge goes from the source node to the destination node, with the specified edge weight.

Edge Name Source Node Destination Node Edge Weight
affiliated_with Author Institution
associated_concept Paper Concept Amazon Comprehend Medical confidence scores
authored_by Paper Author
cites Paper Paper
associated_topic Paper Topic Latent Dirichlet Analysis scores

The following is an example graph of the CKG. Notice the outgoing connections from Paper nodes to Author and Concept nodes, and from Author nodes to Institution nodes.

A paper (blue nodes) is written by authors (yellow nodes), who are affiliated with specific institutions (green nodes). A paper can have multiple authors, and authors can belong to multiple institutions.

Concept nodes (red nodes) are generated by using Amazon Comprehend Medical Detect Entities V2 to extract medical information such as medical conditions, medicine dosage, anatomical information, treatment procedures, and drugs.

Topic nodes (not shown in the preceding diagram) are generated by using an extension of the Latent Dirichlet Allocation model. This generative model groups documents by observed content, giving each document a mixture of topic vectors. For each paper, the model uses the plain text title, abstract, and body. It ignores tables, figures, and bibliographies.

These nodes enrich the graph by allowing connections between papers through authors and institutions, as well as the themes of the paper itself.

Solution overview

In the AWS data lake, you can use a provided CloudFormation template to create a CloudFormation stack. A CloudFormation stack is a collection of AWS resources managed as a single unit. For more information, see Working with Stacks.

To build the CKG, you must create a CloudFormation stack from the provided template. The template creates the necessary AWS resources and ingests the data into the graph.

After building the CKG in your AWS account, you can run basic and advanced queries using Gremlin-Python.

Basic queries perform graph exploration and search with the aim of building graph intuition and gaining familiarity with querying the graph with gremlin-python.

For this post, you perform advanced queries to do the following:

  • Query the CKG to get papers related to a specific concept
  • Determine which paper to read first by ranking the papers by author expertise
  • Determine what paper to read next by creating a related paper recommendation engine using graph queries

Prerequisites

Before you get started, make sure you have the following:

  • Access to AWS account
  • Permissions to create a CloudFormation stack

Using the CloudFormation template

To configure the graph and sample query notebook in your AWS account, you create a CloudFormation stack using a CloudFormation template.

Launch the following one-click template and use the following parameter when prompted during the stack creation process (leave all other parameters as default):

AutoIngestData: True

The template does the following:

  • Creates a Neptune database cluster.
  • Creates an Virtual Private Cloud (VPC).
    • All Neptune clusters have to run inside VPCs, and so this template also sets up a VPC with private and public subnets, which makes sure Neptune has access to the internet and protects it from unauthorized access.
  • Creates an Amazon SageMaker notebook instance and gives it permission to access the Neptune cluster within the VPC.
    • The stack also loads a number of Python libraries into the notebook instance, which allows you to interact with the graph.
    • The notebook instance also contains demo Jupyter notebooks, which demonstrate how to query the graph.
  • Ingests the data into the graph. All the data for this graph is stored in the public AWS COVID-19 data lake.

When the template is complete, you can access the Amazon SageMaker notebook instance and the sample Jupyter notebook. For instructions, see Access Notebook Instances.

Neptune and Gremlin

Neptune is compatible with Apache TinkerPop3 and Gremlin 3.4.1, which means you can use the Gremlin graph traversal language to query a Neptune database instance. Gremlin-Python implements Gremlin within the Python language. For more information about Gremlin, see PRACTICAL GREMLIN: An Apache TinkerPop Tutorial.

Querying Neptune is possible through gremlin-python. For more information, see Accessing the Neptune Graph with Gremlin.

Basic queries

These queries demonstrate how to use gremlin-python in an Amazon SageMaker Jupyter notebook to do basic graph exploration and search. Jupyter notebooks with these queries and more are available in the Amazon SageMaker notebook instance that you created earlier. Because the dataset is constantly evolving, your outputs may vary from the outputs in this post.

Graph exploration

The following queries get the number of nodes and the number of each type of node. To extract all the vertices of the graph, you use g.V() and hasLabel(NODE_NAME) to filter the graph for specific nodes and count() to get the node count. The terminal step next() returns the result.

nodes = g.V().count().next()
papers = g.V().hasLabel('Paper').count().next()
authors = g.V().hasLabel('Author').count().next()
institutions = g.V().hasLabel('Institution').count().next()
topics = g.V().hasLabel('Topic').count().next()
concepts = g.V().hasLabel('Concept').count().

print(f"papers: {papers}, authors: {authors}, institutions: {institutions}")
print(f"topics: {topics}, concepts: {concepts}")
print(f"Total number of nodes: {nodes}")

The following code is the output:

papers: 42220, authors: 162928, institutions: 21979
topics: 10, concepts: 109750
Total number of nodes: 336887

The following queries return the number of edges and the number of each type of edge. To extract all the edges of the graph, you use g.E() and hasLabel(EDGE_NAME) to filter the graph for the specific edges and count() to get the edge count. The terminal step next() returns the result.

paper_author = g.E().hasLabel('authored_by').count().next()
author_institution = g.E().hasLabel('affiliated_with').count().next()
paper_concept = g.E().hasLabel('associated_concept').count().next()
paper_topic = g.E().hasLabel('associated_topic').count().next()
paper_reference = g.E().hasLabel('cites').count().next()
edges = g.E().count().next()

print(f"paper-author: {paper_author}, author-institution: {author_institution}, paper-concept: {paper_concept}")
print(f"paper-topic: {paper_topic}, paper-reference: {paper_reference}")
print(f"Total number of edges: {edges}")

The following code is the output:

paper-author: 240624, author-institution: 121257, paper-concept: 2739666
paper-topic: 95659, paper-reference: 134945
Total number of edges: 3332151

Graph search

The following query filters the graph for all the author nodes and uses valueMap(), limit(), and toList() to sample five authors from the CKG and return a dictionary:

g.V().hasLabel('Author').valueMap('full_name').limit(5).toList()

The following code is the output:

[{'full_name': ['· J Wallinga']},
 {'full_name': ['· W Van Der Hoek']},
 {'full_name': ['· M Van Der Lubben']},
 {'full_name': ['Jeffrey Shantz']},
 {'full_name': ['Zhi-Bang Zhang']}]

The following query filters the graph for topic nodes. The query uses hasLabel() and has() to filter for a specific topic, followed by both() to get paper nodes from both incoming and outgoing edges from this topic node, limit() to limit the results, and values() to get a specific property from the topic node. The terminal step toList() returns the results as a list.

g.V().hasLabel('Topic').has('topic', 'virology').both()\
.limit(3).values('title').toList()

The following code is the output:

['Safety and Immunogenicity of Recombinant Rift Valley Fever MP-12 Vaccine Candidates in Sheep',
 'Ebola Virus Neutralizing Antibodies Detectable in Survivors of theYambuku, Zaire Outbreak 40 Years after Infection',
 'Respiratory Viruses and Mycoplasma Pneumoniae Infections at the Time of the Acute Exacerbation of Chronic Otitis Media']

Advanced queries

The following image illustrates the type of information that you can obtain from the CKG. You can use the CKG to rank a list of papers and to get information related to a single paper (see the following sections for more details).

Ranking papers related to a concept

For this use case, you query the CKG to get papers related to viral resistance and rank the papers by leading authors. To find all the papers on this topic, enter the following code:

concept = 'viral resistance'
papers = g.V().has('Concept', 'concept', concept).both().values('SHA_code').toList()

print(f"Number of papers related to {concept}: {len(papers)}")

The following code is the output:

Number of papers related to viral resistance: 74

To rank the papers, enter the following code:

def author_prolific_rank(graph_results, groupby='SHA_code'):
    P = g.V()\
        .has('Paper', 'SHA_code', within(*graph_results))\
        .group().by(__.values(groupby))\
        .by(
            __.out('authored_by').local(__.in_('authored_by').simplePath().count()).mean()
        ).order(local).by(values, desc).toList()[0]
    
    return P
    
def sha_to_title(sha):
    title_list = g.V().has('Paper', 'SHA_code', sha)\
    .values('title').toList()
    
    return ''.join(title_list)
author_ranked = author_prolific_rank(papers)author_ranked_df = pd.DataFrame(
    [author_ranked], index=['Score']
).T.sort_values('Score', ascending=False).head()
author_ranked_df['title'] = author_ranked_df.index.map(sha_to_title)

for row in author_ranked_df.drop_duplicates(subset='title').reset_index().iterrows():
    print(f"Score: {round(row[1]['Score'], 2)}\tTitle: {row[1]['title']}")

The following code is the output:

Score: 27.5    Title: Surveillance for emerging respiratory viruses
Score: 11.45    Title: Third Tofo Advanced Study Week on Emerging and Re-emerging Viruses, 2018
Score: 10.5    Title: The Collaborative Cross Resource for Systems Genetics Research of Infectious Diseases
Score: 10.0    Title: RNA interference against viruses: strike and counterstrike
Score: 9.0    Title: Emerging Antiviral Strategies to Interfere with Influenza Virus Entry

Recommending related papers

For this use case, you want to know what related papers to read next based on the paper you just read. To develop a recommendation engine, you can use graph queries. Papers and concepts are nodes in the graph connected through edges. Concepts are machine learning (ML) derived by Amazon Comprehend Medical, and include a confidence score conf to indicate how confident the ML system is that concept c is in paper p.

To determine if two papers are related to each other, you first define a similarity score S. Paper nodes are connected in the CKG with an intermediate concept node.

To generate a list of papers related to a particular paper P, you generate the similarity score between P and every other paper in the CKG and rank the results. A paper with a higher similarity score is more related to paper P.

The similarity score between two papers is the weighted sum of all the paths between the two papers. In matrix terms, the score of a candidate paper Pc is the dot product of the vectors PT and Pc , where vector P and candidate paper Pc are sizes [N_{concepts}, 1] each. Each element ei is the edge weight from the paper to the concept i and 0 if there is no edge.

The following diagram illustrates an example recommendation framework.

In this diagram, the path between the two papers is through the term rhinovirus. Each paper-concept edge has a confidence score conf.

For every path between the two papers, you multiply the confidence scores. To calculate the similarity score between the two papers, you add up all the confidence score products.

First, get the related papers for the following paper. See the following code:

sha = "f1716a3e74e4e8139588b134b737a7e573406b32"
title = "Title: Comparison of Hospitalized Patients With ARDS Caused by COVID-19 and H1N1"
print(f"Title: {title}")
print(f"Unique ID: {sha}") 

The following code is the output:

Title: Comparison of Hospitalized Patients With ARDS Caused by COVID-19 and H1N1
Unique ID: f1716a3e74e4e8139588b134b737a7e573406b32

Query the graph with the algorithm defined previously. See the following code:

rankings = g.withSack(1).V().has('Paper', 'SHA_code', sha)\
.outE('associated_concept').sack(mult).by('score').inV().simplePath()\
.inE('associated_concept').sack(mult).by('score')\
.group().by(__.outV().values('title')).by(__.sack().sum()).toList()
pd.DataFrame(rankings, index=['Score']).T.sort_values('Score', ascending=False).head()

The following table summarizes the output.

Score
Infectious Diseases 160.095712
Abstracts cont. 144.739026
Oral presentations 93.617558
Respiratory Viruses 75.077273
SIOP ABSTRACTS 72.091667

Cleaning up

The AWS COVID-19 data lake is hosted for free in Amazon Simple Storage Service (Amazon S3), and standard charges to request data in Amazon S3 are disabled on the public data lake bucket, so you don’t incur any costs. However, when you build the CKG in your AWS account using the provided CloudFormation stack, you use Neptune and Jupyter notebooks hosted on Amazon SageMaker, which may incur charges. For more information, see Amazon Neptune pricing and Amazon SageMaker pricing.

To avoid recurring costs, delete the CloudFormation stack when you’re finished. For instructions, see Deleting a Stack on the AWS CloudFormation Console. Be sure to delete the root stack (the stack you created earlier). Deleting the root stack deletes any nested stacks.

Conclusion

You can also build a knowledge graph with CloudFormation templates and visualize the graph using the third-party graph visualization tool Tom Sawyer Graph Database Browser. For more information, see Exploring scientific research on COVID-19 with Amazon Neptune, Amazon Comprehend Medical, and the Tom Sawyer Graph Database Browser.

Combining our efforts across organizations and scientific disciplines can help us win the fight against the COVID-19 pandemic. With the CKG, you can ask and answer questions regarding the scientific literature around COVID-19. We believe that through an open and collaborative effort that combines data, technology, and science, we can inspire insights and foster breakthroughs necessary to contain, curtail, and ultimately cure COVID-19.

 


About the Authors

 

Ninad Kulkarni is a Data Scientist at the Amazon Machine Learning Solutions Lab. He helps customers adopt ML and AI solutions by building solutions to address their business problems. Most recently, he has built predictive models for sports customers for on-screen consumption to improve fan engagement.

 

 

Miguel Romero is a Data Scientist at the Amazon Machine Learning Lab where he helps AWS customers adress business problems with AI and cloud capabilities. Most recently, he has built CV and NLP solutions for sports and healthcare.

 

 

 

Colby Wise is a Data Scientist and manager at the Amazon Machine Learning Solutions Lab, where he helps AWS customers across different industries accelerate their AI and cloud adoption.

 

 

 

George Price is a Deep Learning Architect at the Amazon Machine Learning Solutions Lab, where he helps build models and architectures for AWS customers. Previously, he was a software engineer working on Amazon Alexa.