AWS for Industries

AstraZeneca fine-tunes genomics foundation models with Amazon SageMaker

Understanding the human genome is a massive undertaking. The compute power and tools needed alone are staggering, but could be worth untold discoveries. AstraZeneca’s Centre for Genomics Research (CGR) leads the company’s Genomic Initiative on a global scale, with the ambitious goal of analyzing up to two million genomes by 2026.

Through this integrated analysis of enriched genomic and clinical data, they strive to:

  • Enable a deeper understanding of the biology of disease
  • Identify novel genetic targets for medicines
  • Support patient selection in clinical trials
  • Enable patients to be matched with treatments more likely to benefit them

To support these goals, and enable robust clinical insights, AstraZeneca’s CGR is delivering a scalable analysis platform with an arsenal of advanced artificial intelligence and machine learning (AI/ML) tools to enable discoveries of novel genetic targets, such as JARVIS, MILTON and Mantis-ML.

One crucial objective of some of these tools is to assist in understanding which amongst the billions of genetic variations in humans are likely driving disease mechanisms. These variations are called pathogenic. Undertaking this task on a large scale, involving whole genome sequencing data encompassing around three billion nucleotides, is a daunting challenge and a well-suited use case for machine learning.

We’ll demonstrate one of the recent pathogenicity prediction tools developed by AstraZeneca to analyze human genetic variation. Specifically, we are interested in regions of the DNA that are not translated into proteins, namely, the “non-coding” genome. The non-coding genome represents 98 percent of the human DNA and this is where the majority of human variation resides.

We’ll showcase how Amazon SageMaker was used to fine-tune HyenaDNA, a powerful genomic foundational model, allowing AstraZeneca to analyze sequences of up to one million tokens at a single nucleotide level. This is a significant improvement compared with prior models for pathogenicity prediction that could only process context up to a few thousand nucleotides around each variant of interest. The finetuned HyenaDNA embeddings outperformed a well-known baseline score (CADD) in four out of five test datasets for pathogenicity prediction by 20.9 percent on average.

This model had no additional prior knowledge of genomic functionality and low development effort with potential for further gains with knowledge augmentation. Establishing better models for pathogenicity prediction enables accurate prioritizing of findings from large scale genomic association studies and will greatly shorten the path to novel target discovery.

Opportunity overview

Predicting pathogenicity of genomic variants

To understand genetic variation, geneticists have collected the most commonly occurring nucleotides in each position of the human genome by observing large populations.

They used these observations to construct the reference genome, which is a template genome incorporating the most up-to-date information we have on human genomics. Each individual genome can then be represented as a collection of deviations from the reference—which we call variants.

what are variants figureFigure 1. What are variants?

The significance of each variant is unknown. What AstraZeneca really wants to know is which of them are likely to cause disease, specifically, which are pathogenic. This can help them understand what drives disease and reveal opportunities for novel drug targets. By harnessing large datasets of well-known annotated pathogenic and benign variants, it becomes possible to build ML models capable of predicting pathogenicity for new variants of unknown significance across the whole human genome.

For variants that fall into protein-coding regions, AstraZeneca CGR has already published several tools to predict whether they are likely pathogenic or not, such as RVIS, MTR and OncMTR.

On the other hand, predicting the effect of variants in the non-coding genome is an arduous task. As this part of the genome is not translated into proteins, variants in this region have no clear effect, such as the loss of function of a protein, and often have unknown functional importance. Prior work by the team led to the development of JARVIS, a deep learning-based predictor of pathogenicity in non-coding variants, which amongst other features used the raw DNA sequence context around a variant within a window of three thousand nucleotides.

Genomic foundational models

Genomic foundation models represent a new approach in the field of genomics, offering a way to understand the language of DNA. DNA sequences are extremely long (up to billions of nucleotides), and the sensitivity required to fully understand the effects of evolutionary, environmental or other pressures makes them a particularly challenging domain for large-scale pretraining.

Two recent genomic foundation models that can integrate information over long genomic sequences, while retaining sensitivity to single-nucleotide changes, are:

1. HyenaDNA uses the transformer architecture, like other genomic models, except that it replaces each self-attention layer with a Hyena operator. This widens the context window to allow processing of up to one million tokens, substantially more than prior models, allowing it to learn longer-range interactions in DNA.

2. Evo is trained at a single-nucleotide (byte) resolution, on a large corpus of prokaryotic and phage genomic sequences. Evo is a seven billion parameter model trained to generate DNA sequences using a context length of 131 kilobases at single-nucleotide resolution. It is based on StripedHyena, a deep signal processing architecture designed to improve efficiency and quality over the prevailing Transformer architecture.

Using genomic foundational models for pathogenicity prediction

Genomic foundational models, such as HyenaDNA allow extracting very long context windows around each variant of interest. This enables capturing interactions with elements that are further away, even by tens of thousands of nucleotides. Such interactions are relevant and occur often in the human genome. As an example, the rare genetic condition aniridia (lack of iris in the eye) can be caused by a variant located 133 kilobases downstream of the relevant PAX6 gene.

The team at AstraZeneca CGR was able to harness this property of genomic large foundational models to produce a novel pathogenicity predictor. It was able to examine a long sequence context of 32,000 nucleotides around each variant to determine whether it’s likely pathogenic or not. This model had no additional prior knowledge of genomic functionality and relied solely on pretrained foundational model embeddings and the raw genomic sequence context around each variant.

By fine-tuning a lightweight classifier on top of HyenaDNA, this large foundational model was re-purposed and achieved remarkable performance for variants in the non-coding genome.

Solution overview

Amazon SageMaker Model Training reduces the time and cost to train and tune machine learning (ML) models at scale without the need to manage infrastructure. You can take advantage of the highest-performing ML compute infrastructure currently available. SageMaker can automatically scale infrastructure up or down, from one to thousands of GPUs.

SageMaker provides distributed training libraries that can run highly-scalable and cost-effective, custom data-parallel and model-parallel deep learning training jobs. You can start an ephemeral job on-demand that runs a program, with a container image, using the Amazon SageMaker Python SDK without self-managing any compute infrastructure. Specifically, Amazon SageMaker provides flexibility when it comes to the choice of container image, run script, and instance configuration, and supports a wide variety of storage options.

The team opted for a JupyterLab space setup within Amazon SageMaker Studio, which was a familiar environment to transition into.

The training data were stored on an Amazon Elastic File System (Amazon EFS), shared across all SageMaker users and training job instances. Using Amazon EFS was beneficial to facilitate collaborative workloads, distribute training, and assist in minimizing start-up time for new training jobs. The data were instantly mounted and available on each new training job, eliminating the need to wait for data transfers from Amazon Simple Storage Service (Amazon S3). Amazon EFS provided a managed, scalable file storage service and allowed the team to define infrequent access policies and automatically move infrequently accessed files to low-cost cold storage.

Figure 1. What are variants?Figure 2. Using JupyterLab on Amazon SageMaker

The adapted HyenaDNA was fine-tuned using a ml.g4dn.2xlarge training instance, which offered a good balance of cost and GPU performance, with a prebuilt SageMaker PyTorch container. Additional dependencies were pip installed from a requirements.txt file (as described in using third-party libraries) or provided as pre-built binaries to the training jobs, directly mounted from Amazon EFS.

Amazon SageMaker offers a fully managed MLflow capability. You can compare model performance, parameters, and metrics across experiments in the MLflow UI, keep track of your best models in the MLflow Model Registry, automatically register them as a SageMaker AI model, and deploy registered models to SageMaker AI endpoints.

Model architecture and training

HyenaDNA uses the transformer architecture with an implicit global convolution layer injected into the attention block that allows scaling to very long input token sequences. It was pre-trained on the reference genome to receive a nucleotide sequence input and predict the next nucleotide following that sequence.

Fine-tuning HyenaDNA for pathogenicity prediction involved adding a couple of fully connected layers (FC) on top of the HyenaDNA pooled frozen embeddings and training these to predict whether each variant was pathogenic or benign. As input to the model, the sequence context around each variant was extracted from the reference genome.

Model architecture for fine-tuning HyenaDNA. Added a couple of fully connected layers (FC) on top of the HyenaDNA pooled frozen embeddings and training these to predict whether each variant was pathogenic or benign. As input to the model, the sequence context around each variant was extracted from the reference genome.)Figure 3. Model architecture for fine-tuning HyenaDNA

The training script used the Adam Optimizer with a binary cross entropy loss. Group-K-fold cross validation was employed to make sure variants from different chromosomes were sampled for training and validation. Early stopping was applied to halt training if the validation loss reached a plateau.

After defining the training script in python, a SageMaker training job was configured and submitted. First, specifying the hyperparameters, for example:

Python code:

```
hyperparameters = {
    "batch": 15,
    "lr": 6e-4,
    "weight_decay": 0.1
}

```
Python

Then defining some metrics to be captured in Amazon CloudWatch logs, such as:

Python code:

```
metric_definitions=[
   {'Name': 'val:auc', 'Regex': 'val_auc=(.*?);'}
]
```
Python

Finally, defining a PyTorch SageMaker estimator and type of instance required for training:

Python code:

```
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role
from sagemaker.inputs import FileSystemInput
from Sagemaker import Session

estimator = PyTorch(
    base_job_name="tune_hyenaDNA",
    entry_point="train.py", 
    source_dir="/efs/path/to/dataset",
    instance_type="ml.g4dn.xlarge",
    instance_count=1,
    image_uri=<PYTORCH_IMAGE_URI>,
    role= get_execution_role(),
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    sagemaker_session= Session(),
    subnets=["<subnet ID to launch training instances>"], 
    security_group_ids=["<security group ids to attach>"],
    max_run=48*60*60,
)

file_system_input = FileSystemInput(
    file_system_id="<EFS ID>",
    file_system_type="EFS",
    directory_path="efs/relative/path/to/dataset",
    file_system_access_mode="rw"
)

estimator.fit(file_system_input)
```
Python

In this configuration, appropriate private subnets and security groups were configured for the training job instances to enable mounting the Amazon EFS.

Data, experiments, and results

The dataset for training comprises of examples of known pathogenic variants from ClinVar and benign variants from denovoDB, sampled with a ratio of 1:1. Variants are represented by their position in the reference genome. Testing used an independent set of benchmarks with pathogenic non-coding variants and benign controls from denovoDB.

The following table provides information about training and testing datasets.

Dataset description. Dataset for training comprised approximately 6 K variants from ClinVar and denovoDB, and datasets for testing comprised 5 benchmarks, ranging from 30 to 13 K variants from Wang et al., and denovoDB.)Figure 4. Dataset description

Various hyperparameters for the training script, such as the batch size, learning rate (lr) and optimizer weight decay were optimized by observing the validation ROC AUC and using the built-in SageMaker hyperparameter tuning.

Amazon SageMaker AI automatic model tuning (AMT) finds the best version of a model by running many training jobs on your dataset. Amazon SageMaker AI AMT is also known as hyperparameter tuning. To do this, Amazon SageMaker AI AMT uses the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values, which creates a model that performs the best, as measured by a metric you choose.

Selecting training hyperparameters. Amazon SageMaker AMT helped identify the set of hyperparameter values with the highest validation AUC score (0.7743-0.84050) across a given range for batch size (8.00000-15.00000), learning rate (0.00060-0.00403), and weight decay (0.00109-0.10000).)

Results are presented in Figure 5 for the best set of hyperparameters across the five testing benchmarks. For comparison, the performance of the well-established CADD score is provided as a baseline. The finetuned HyenaDNA embeddings achieved competitive performance in the tested benchmarks and outperformed CADD in four out of five datasets.

Performance improvement with HyenaDNA compared to CADD, a well-known baseline score. The finetuned HyenaDNA embeddings achieved competitive performance in the tested benchmarks and outperformed CADD in four out of five datasets. Benchmark EQTL Common (Wang et al.) HyenaDNA 0.752, CADD 0.511. Benchmark GWAS Common (Wang et al.) HyenaDNA 0.52, CADD 0.447. Benchmark Somatic Cosmic Rare (Wang et al.) HyenaDNA 0.753, CADD 0.685. Benchmark Germline ClinVar Rare (Wang et al.) HyenaDNA 0.758, CADD 0.926. Benchmark Denovo ASD (Wang et al.) HyenaDNA 0.861, CADD 0.781

Conclusion

The CGR team at AstraZeneca, with the help of Amazon SageMaker, could efficiently and with minimal effort fine-tune a large genomic foundation model (HyenaDNA) for a downstream task of interest. This experiment demonstrated that HyenaDNA embeddings alone could achieve good performance in identifying which new variants in the human genome are likely to cause disease.

Adding these powerful embeddings can help further improve on existing variant effect predictors, providing improved evidence to help prioritize which variants to investigate further.

Contact an AWS Representative to know how we can help accelerate your business.

Further Reading

About AstraZeneca:

AstraZeneca is a global, science-led, patient-focused pharmaceutical company, dedicated to transforming the future of healthcare by seeking to unlock the power of what science can do.

Hasan Poonawala

Hasan Poonawala

Hasan Poonawala is a Senior AI/ML Specialist Solutions Architect at AWS, working with Healthcare and Life Sciences customers. Hasan helps design, deploy and scale generative AI and machine learning applications on AWS. He has 15+ years of combined work experience in machine learning, software development and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Anna Maria Tsakiroglou

Anna Maria Tsakiroglou

Anna Maria Tsakiroglou, PhD, is a Senior Data Scientist at the Centre for Genomics Research at AstraZeneca. She has 8+ years of experience in AI for healthcare, with a broad background in generative AI, multi-omics and computer vision. Before joining AZ, she was a co-founder and Chief Scientific Officer at Spotlight Pathology. She enjoys building high performing AI models for clinical applications and is a co-organizer of PyData (Cambridge).

Dimitrios Vitsios

Dimitrios Vitsios

Dimitrios Vitsios is the Director of Data Science at the Centre for Genomics Research at AstraZeneca. He completed his PhD in Computational Biology at the University of Cambridge and the EMBL-EBI, focusing on AI methods development for functional genomics and non-coding RNAs. Since 2020, he’s been leading a team of machine learning researchers/engineers focusing on deep learning methods development for target identification and validation, population genetics and multi-omics.