AWS for Industries

Accelerating drug discovery through knowledge graph

Bio pharma companies have an increasing need to accelerate insights specific to drug discovery, leveraging molecular, manufacturing, lab data and other data sets. The use of an AWS knowledge graph across the drug discovery value chain could deliver just that value.

Overview of the current situation

In 2019 the cost of R&D in pharma was estimated at $83 Billion (Congressional Budget Office estimates). The number of new drugs approved for sale has gone up significantly, an increase of 60% compared to the previous decade. However, the opportunity to accelerate drug discovery through digital transformation is a game changer as Biopharmaceutical companies race to find therapies in under-served therapeutic areas.

Modernizing research sites to automate drug discovery workflows will allow organizations to realize a Pharma 4.0 vision. This new vision can be focused on outcomes specific to clinical, process development, and manufacturing data aggregation.

Over the last several years, Biopharmaceutical companies have collected troves of data either in siloed or connected formats residing in research, discovery, and manufacturing process-specific workflows. These data sources include:

  • In-silico analysis of molecular structures
  • Wet lab analysis during lead identification
  • Data generated during molecule creation in labs or in cell line
    • depending on whether they are small or large molecules
  • Clinical trial data
    • where adverse events specific to a drug can be associated to different factors
  • Systematic documents
    • used in bio-medicine to contextualize experimental data

When a discovery team embarks on new research for a therapy by evaluating candidate molecules, the chances are very likely that they have existing information about many of these molecules captured within multiple internal data systems.

However, this data is largely unavailable to the discovery team, as it is distributed across systems and departments. Data from manufacturing inputs may also not be visible. It can take weeks or months to connect this data and get a full view of all information available on a particular candidate molecule, or anything else.

The solution

This problem of siloed, disconnected data sources can be solved through creation of a knowledge graph. A knowledge graph can unlock the ability to quickly connect and find data. This helps in lead optimization as various factors, based on previous experiments, can inform decisions on success probability and if it’s worth pursuing further research.

A solution design architecture leveraging multiple AWS services can effectively connect data across various silos and source systems. The consolidated data can then be applied to different downstream tasks.

Architecture of knowledge graph construction and downstream tasks

Architecture of knowledge graph construction and downstream tasks

The architecture shown above addresses two primary needs:

  1. user experience for the front-end and
  2. data centralization setup associated with the back-end.

In the process of developing a new drug product, access to research efforts from different domains from past drug design development is crucial. Such domains are scattered in different source systems, and requires a back-end pipeline.

The back-end provides the infrastructure to ingest siloed data into Amazon Neptune by leveraging AWS Glue jobs to perform the necessary scheduled Extract, Transform, and Load (ETL) processes. Once AWS Glue is used to perform data conversion into a data format that fits with the knowledge graph, the data is staged in an Amazon Simple Storage Service (Amazon S3) bucket. After which, AWS Lambda is used to push the data from Amazon S3 into Amazon Neptune. Status of the ETL jobs is handled via notifications using Amazon Simple Notification Service (Amazon SNS) and monitored using Amazon CloudWatch. This type of pipeline can help connect disparate data into an ever-growing knowledge graph that would otherwise take days to access individually.

Furthermore, a predictive model can be developed via Amazon SageMaker as an option to extend the knowledge graph using libraries like Deep Graph Library (DGL) for different downstream tasks. This includes, but is not limited to, node classification for toxicity classification or edge inference for drug similarity recommendation. The tasks out of such a model can further enrich the knowledge graph.

Another downstream task of the knowledge graph is its use as a search app, which can be deployed using AWS Amplify. The deployed app provides users access to the centralized knowledge graph data to perform searches to discover against the constructed graph’s data. As a result, end users can use the identified data to help make better decisions on the next steps towards drug development. A search could lead to adjusting an experiment’s parameter to increase drug product yield based on a past experiment on a similar drug.


Siloed information, spread across an organization, is a challenge but can be solved by implementing a knowledge graph. This can serve different end-users, whether that is comparing the characteristics of different historic drugs or to use the historic data to build predictions for a new drug design. Instituting a knowledge graph can help provide cost savings while enabling a faster time-to-market through acute drug discovery cycle-time reduction and improved visibility into molecule dossier project tracking. Additionally, insights and timely reporting across the organization help accelerate drug discovery and reduce attrition of drug candidates, which can be achieved through knowledge graph constructs that incorporate dossier level data inputs.

It can be challenging for biopharma life science customers to understand where to start, which is why AWS offers workshops specifically designed to support and facilitate the development of knowledge graph programs. Reach out to your account executive or contact the AWS sales team to understand how you can get started with AWS to initiate or accelerate drug discovery using knowledge graph.

Kannan Raman

Kannan Raman

Kannan Raman leads the global sales team for ProServe Healthcare and Life Sciences practice at AWS. He has 22+ years of life sciences experience and provides thought leadership in digital transformation. He works with C level client executives to help them with their digital transformation agenda.

Dashiell Flynn

Dashiell Flynn

Dashiell Flynn is a Machine Learning Strategist with 20+ years of experience across artificial intelligence (AI), machine learning (ML), and quantitative analysis. He works with customers to translate their business objectives into actionable ML use cases; then collaborates with Data Science Teams to develop the associated models and realize the business outcome. Dashiell holds a BA in Economics from Vassar College and an MPA in International Development from Harvard Kennedy School.

Misha St. Lorant

Misha St. Lorant

Misha StLorant has over 15 years in the Life Science and Healthcare industry with specific focus in the biopharmaceutical, immunotherapy and clinical trials domains. Prior to joining AWS, Misha was an engineering solutions director focused on Biopharma and Cell Therapy manufacturing, pharmaceutical clinical trials and imaging patient care pathway workflow optimization at GE. Misha also spent over 11 years at Microsoft running program, product and engineering teams focused on healthcare IT, platforming and patient facing applications. Misha received a BA from Oregon State University in economics and finance, an MBA in technology management from UOP and certificates in statistical analysis from Northwestern University – Kellogg School of Management.

Prithiviraj Jothikumar

Prithiviraj Jothikumar

Prithiviraj Jothikumar, PhD, is a Senior Data Scientist with AWS Professional Services, where he helps customers build solutions using machine learning. He enjoys watching movies and sports and also spends time meditating.