AstraZeneca fine-tunes genomics foundation models with Amazon SageMaker
Understanding the human genome is a massive undertaking. The compute power and tools needed alone are staggering, but could be worth untold discoveries. AstraZeneca’s Centre for Genomics Research (CGR) leads the company’s Genomic Initiative on a global scale, with the ambitious goal of analyzing up to two million genomes by 2026.
Through this integrated analysis of enriched genomic and clinical data, they strive to:
- Enable a deeper understanding of the biology of disease
- Identify novel genetic targets for medicines
- Support patient selection in clinical trials
- Enable patients to be matched with treatments more likely to benefit them
To support these goals, and enable robust clinical insights, AstraZeneca’s CGR is delivering a scalable analysis platform with an arsenal of advanced artificial intelligence and machine learning (AI/ML) tools to enable discoveries of novel genetic targets, such as JARVIS, MILTON and Mantis-ML.
One crucial objective of some of these tools is to assist in understanding which amongst the billions of genetic variations in humans are likely driving disease mechanisms. These variations are called pathogenic. Undertaking this task on a large scale, involving whole genome sequencing data encompassing around three billion nucleotides, is a daunting challenge and a well-suited use case for machine learning.
We’ll demonstrate one of the recent pathogenicity prediction tools developed by AstraZeneca to analyze human genetic variation. Specifically, we are interested in regions of the DNA that are not translated into proteins, namely, the “non-coding” genome. The non-coding genome represents 98 percent of the human DNA and this is where the majority of human variation resides.
We’ll showcase how Amazon SageMaker was used to fine-tune HyenaDNA, a powerful genomic foundational model, allowing AstraZeneca to analyze sequences of up to one million tokens at a single nucleotide level. This is a significant improvement compared with prior models for pathogenicity prediction that could only process context up to a few thousand nucleotides around each variant of interest. The finetuned HyenaDNA embeddings outperformed a well-known baseline score (CADD) in four out of five test datasets for pathogenicity prediction by 20.9 percent on average.
This model had no additional prior knowledge of genomic functionality and low development effort with potential for further gains with knowledge augmentation. Establishing better models for pathogenicity prediction enables accurate prioritizing of findings from large scale genomic association studies and will greatly shorten the path to novel target discovery.
Opportunity overview
Predicting pathogenicity of genomic variants
To understand genetic variation, geneticists have collected the most commonly occurring nucleotides in each position of the human genome by observing large populations.
They used these observations to construct the reference genome, which is a template genome incorporating the most up-to-date information we have on human genomics. Each individual genome can then be represented as a collection of deviations from the reference—which we call variants.
The significance of each variant is unknown. What AstraZeneca really wants to know is which of them are likely to cause disease, specifically, which are pathogenic. This can help them understand what drives disease and reveal opportunities for novel drug targets. By harnessing large datasets of well-known annotated pathogenic and benign variants, it becomes possible to build ML models capable of predicting pathogenicity for new variants of unknown significance across the whole human genome.
For variants that fall into protein-coding regions, AstraZeneca CGR has already published several tools to predict whether they are likely pathogenic or not, such as RVIS, MTR and OncMTR.
On the other hand, predicting the effect of variants in the non-coding genome is an arduous task. As this part of the genome is not translated into proteins, variants in this region have no clear effect, such as the loss of function of a protein, and often have unknown functional importance. Prior work by the team led to the development of JARVIS, a deep learning-based predictor of pathogenicity in non-coding variants, which amongst other features used the raw DNA sequence context around a variant within a window of three thousand nucleotides.
Genomic foundational models
Genomic foundation models represent a new approach in the field of genomics, offering a way to understand the language of DNA. DNA sequences are extremely long (up to billions of nucleotides), and the sensitivity required to fully understand the effects of evolutionary, environmental or other pressures makes them a particularly challenging domain for large-scale pretraining.
Two recent genomic foundation models that can integrate information over long genomic sequences, while retaining sensitivity to single-nucleotide changes, are:
1. HyenaDNA uses the transformer architecture, like other genomic models, except that it replaces each self-attention layer with a Hyena operator. This widens the context window to allow processing of up to one million tokens, substantially more than prior models, allowing it to learn longer-range interactions in DNA.
2. Evo is trained at a single-nucleotide (byte) resolution, on a large corpus of prokaryotic and phage genomic sequences. Evo is a seven billion parameter model trained to generate DNA sequences using a context length of 131 kilobases at single-nucleotide resolution. It is based on StripedHyena, a deep signal processing architecture designed to improve efficiency and quality over the prevailing Transformer architecture.
Using genomic foundational models for pathogenicity prediction
Genomic foundational models, such as HyenaDNA allow extracting very long context windows around each variant of interest. This enables capturing interactions with elements that are further away, even by tens of thousands of nucleotides. Such interactions are relevant and occur often in the human genome. As an example, the rare genetic condition aniridia (lack of iris in the eye) can be caused by a variant located 133 kilobases downstream of the relevant PAX6 gene.
The team at AstraZeneca CGR was able to harness this property of genomic large foundational models to produce a novel pathogenicity predictor. It was able to examine a long sequence context of 32,000 nucleotides around each variant to determine whether it’s likely pathogenic or not. This model had no additional prior knowledge of genomic functionality and relied solely on pretrained foundational model embeddings and the raw genomic sequence context around each variant.
By fine-tuning a lightweight classifier on top of HyenaDNA, this large foundational model was re-purposed and achieved remarkable performance for variants in the non-coding genome.
Solution overview
Amazon SageMaker Model Training reduces the time and cost to train and tune machine learning (ML) models at scale without the need to manage infrastructure. You can take advantage of the highest-performing ML compute infrastructure currently available. SageMaker can automatically scale infrastructure up or down, from one to thousands of GPUs.
SageMaker provides distributed training libraries that can run highly-scalable and cost-effective, custom data-parallel and model-parallel deep learning training jobs. You can start an ephemeral job on-demand that runs a program, with a container image, using the Amazon SageMaker Python SDK without self-managing any compute infrastructure. Specifically, Amazon SageMaker provides flexibility when it comes to the choice of container image, run script, and instance configuration, and supports a wide variety of storage options.
The team opted for a JupyterLab space setup within Amazon SageMaker Studio, which was a familiar environment to transition into.
The training data were stored on an Amazon Elastic File System (Amazon EFS), shared across all SageMaker users and training job instances. Using Amazon EFS was beneficial to facilitate collaborative workloads, distribute training, and assist in minimizing start-up time for new training jobs. The data were instantly mounted and available on each new training job, eliminating the need to wait for data transfers from Amazon Simple Storage Service (Amazon S3). Amazon EFS provided a managed, scalable file storage service and allowed the team to define infrequent access policies and automatically move infrequently accessed files to low-cost cold storage.
Figure 2. Using JupyterLab on Amazon SageMaker
The adapted HyenaDNA was fine-tuned using a ml.g4dn.2xlarge training instance, which offered a good balance of cost and GPU performance, with a prebuilt SageMaker PyTorch container. Additional dependencies were pip installed from a requirements.txt file (as described in using third-party libraries) or provided as pre-built binaries to the training jobs, directly mounted from Amazon EFS.
Amazon SageMaker offers a fully managed MLflow capability. You can compare model performance, parameters, and metrics across experiments in the MLflow UI, keep track of your best models in the MLflow Model Registry, automatically register them as a SageMaker AI model, and deploy registered models to SageMaker AI endpoints.
Model architecture and training
HyenaDNA uses the transformer architecture with an implicit global convolution layer injected into the attention block that allows scaling to very long input token sequences. It was pre-trained on the reference genome to receive a nucleotide sequence input and predict the next nucleotide following that sequence.
Fine-tuning HyenaDNA for pathogenicity prediction involved adding a couple of fully connected layers (FC) on top of the HyenaDNA pooled frozen embeddings and training these to predict whether each variant was pathogenic or benign. As input to the model, the sequence context around each variant was extracted from the reference genome.
Figure 3. Model architecture for fine-tuning HyenaDNA
The training script used the Adam Optimizer with a binary cross entropy loss. Group-K-fold cross validation was employed to make sure variants from different chromosomes were sampled for training and validation. Early stopping was applied to halt training if the validation loss reached a plateau.
After defining the training script in python, a SageMaker training job was configured and submitted. First, specifying the hyperparameters, for example:
Python code:
Then defining some metrics to be captured in Amazon CloudWatch logs, such as:
Python code:
Finally, defining a PyTorch SageMaker estimator and type of instance required for training:
Python code:
In this configuration, appropriate private subnets and security groups were configured for the training job instances to enable mounting the Amazon EFS.
Data, experiments, and results
The dataset for training comprises of examples of known pathogenic variants from ClinVar and benign variants from denovoDB, sampled with a ratio of 1:1. Variants are represented by their position in the reference genome. Testing used an independent set of benchmarks with pathogenic non-coding variants and benign controls from denovoDB.
The following table provides information about training and testing datasets.
Various hyperparameters for the training script, such as the batch size, learning rate (lr) and optimizer weight decay were optimized by observing the validation ROC AUC and using the built-in SageMaker hyperparameter tuning.
Amazon SageMaker AI automatic model tuning (AMT) finds the best version of a model by running many training jobs on your dataset. Amazon SageMaker AI AMT is also known as hyperparameter tuning. To do this, Amazon SageMaker AI AMT uses the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values, which creates a model that performs the best, as measured by a metric you choose.
Results are presented in Figure 5 for the best set of hyperparameters across the five testing benchmarks. For comparison, the performance of the well-established CADD score is provided as a baseline. The finetuned HyenaDNA embeddings achieved competitive performance in the tested benchmarks and outperformed CADD in four out of five datasets.
Conclusion
The CGR team at AstraZeneca, with the help of Amazon SageMaker, could efficiently and with minimal effort fine-tune a large genomic foundation model (HyenaDNA) for a downstream task of interest. This experiment demonstrated that HyenaDNA embeddings alone could achieve good performance in identifying which new variants in the human genome are likely to cause disease.
Adding these powerful embeddings can help further improve on existing variant effect predictors, providing improved evidence to help prioritize which variants to investigate further.
Contact an AWS Representative to know how we can help accelerate your business.
Further Reading
- SageMaker training with examples
- More examples of using genomics language models on AWS
- A tutorial to obtain the HyenaDNA embeddings and perform finetuning
- Pre-training genomic language models using AWS HealthOmics and Amazon SageMaker describes how to pre-train a HyenaDNA model with your own proprietary data on AWS
About AstraZeneca:
AstraZeneca is a global, science-led, patient-focused pharmaceutical company, dedicated to transforming the future of healthcare by seeking to unlock the power of what science can do.