AWS Startups Blog
Accelerating Drug Development with Amazon Comprehend at Sumitovant
Guest post by Justin S. Lee, Digital Innovator, Sumitovant Biopharma, Inc.
At Sumitovant Biopharma, we seek to discover the drugs of the future and rapidly get them to the patients who need them. Scientific research is key to our endeavor. To help us bring medicines to market faster, we need to pick out specific insights from the ever-growing body of literature on chemistry, biology, and disease. Connecting in-house scientists and clinicians to these insights can alter the approach to developing a drug, potentially revealing new diseases we might treat, better ways to give the drug to patients, or ways to reduce serious side effects such as drug-drug interactions. Developing a drug is a years-long and highly interdependent set of scientific, industrial, and regulatory processes; acting on any insights sooner enables these processes to complete as quickly as possible.
Literature Searches for Clinical Due Diligence
At Sumitovant, the Advanced Comprehensive Execution and Strategy (ACES) team is responsible for monitoring, synthesizing, and interpreting the scientific literature and using this information to make better drug development decisions. The ACES team provides guidance and due diligence to our drug development teams on medicine and pharmacology, CMC (chemistry, manufacturing, and control), clinical trial design and operations, and regulatory approval processes. As a Digital Innovator embedded with the ACES team, I use software development and machine learning practices to build tools that accelerate their work, flag relevant research, and extract meaningful insights from literature to make the drug development process more effective.
As a clinical due diligence team, ACES needs to be aware of the latest clinical research in a wide variety of areas. But staying on top of advances in drug development can be an overwhelming task, even for experienced scientists. Manual searches typically involve looking through online resources such as PubMed or ClinicalTrials.gov – centralized repositories of research in the life sciences maintained by the U.S. National Institutes of Health. ClinicalTrials.gov lists key details and results of clinical trials having any sites in the United States. Unfortunately, clinical trial results are sometimes posted late to ClinicalTrials.gov, or not at all. In these cases, we need to turn directly to publications describing the clinical trials. PubMed is a searchable repository of life sciences publications, and features publication metadata such as authors and full-text abstracts. Almost always, the key results of a clinical study publication are stated in its abstract. However, PubMed is not specifically tailored toward clinical trials, and so its frontend interface does not express the meaningful clinical components of a paper or study for consumption. This means that valuable insights that could shape our drug discovery and development efforts are either discovered more slowly or, much worse, note discovered at all. So, the problem is twofold: 1) finding relevant research, and 2) extracting relevant features from this research so that they can quickly be evaluated for impact.
The Solution – Amazon Comprehend
To help advance these goals, we developed an internal webapp called the Study Summarizer that searches for PubMed entries and labels key results within clinical study publications. The Study Summarizer uses the PubMed API to present search results with relevant clinical trial data. Then, it calls a model trained using Amazon Comprehend to identify sentences containing key results, and point out those sentences to the user.
Comprehend made it simple to train and deploy a custom text classification model. The first step was to find an appropriate dataset, which was not difficult. The application of natural language processing (NLP) to scientific text continues to be an area of intense research interest, and the research community has provided curated, open-source datasets for use by other researchers and practitioners to train their own models. I was able to find a large dataset of sentences from scientific abstracts labeled with discrete categories describing the contents of those sentences. The labels are based on a logical device common in the life sciences called the “PICO process.” PICO is an acronym that is used to identify the information needed to characterize a clinical trial. While there are multiple variants of the acronym, one of the most common is “Population, Intervention, Comparison, Outcome.” In the case of this dataset, one of the labels specifically tagged study results. Using this category, the Study Summarizer is able to send abstracts sentence-by-sentence to the trained Comprehend model and use the model’s predicted probability distributions to filter out sentences with a low probability of containing results.
Comprehend handled all the typical to-do’s of training a custom machine learning model. We didn’t need to spin up a dedicated training machine with GPUs or experiment with instance types to ensure that training finished within a reasonable amount of time. We didn’t need to write custom data processing pipelines, write code to configure any models, tune hyperparameters, or pick tokenizers. Most important of all, the trained model was immediately available in production. Further, because the Study Summarizer exists within Sumitovant’s Digital Innovation Platform, which is built on AWS and enables us to rapidly create, debug, and deploy internal applications throughout our organization and its affiliates, our DevOps team was able to seamlessly integrate Comprehend into the app.
Going Forward
The Study Summarizer is already accelerating the path by which research informs drug development practice at Sumitovant. Our next steps are to continue to build and refine feature extraction capabilities for increasingly specialized research components of interest. With the Comprehend model in production and in use by the app, the stage has been set to achieve more granular information extraction on each abstract. Each iteration of the Study Summarizer will enable ACES to be more efficient than before; every hour saved on searching for papers can be dedicated to delivering scientific due diligence to support our drug pipeline. Ultimately, this means that our new drug candidates are filed with regulatory agencies more quickly, and, if approved, end up in your local pharmacy that much faster.
To read more about the exciting and innovative pipeline of drug candidates we are developing at Sumitovant, please visit our website at https://www.sumitovant.com/.