AWS HPC Blog
Intel Open Omics Acceleration Framework on AWS: fast, cost-efficient, and seamless
This post was contributed by Olivia Choudhury, PhD and Aniket Deshpande from AWS; Sanchit Misra, PhD, Vasimuddin Md., PhD, Narendra Chaudhary, PhD, Saurabh Kalikar, PhD, and Manasi Tiwari, PhD, Research Scientists at Intel Labs India; Ashish Kumar Patel, contingent worker with Intel Technology India Pvt. Ltd.
We are living in the exciting times of the rapidly growing field of omics, including genomics, proteomics, transcriptomics, and metabolomics data. Our ability to measure omics data is increasing at a dramatic pace and new data science (AI and data management) pipelines are being developed and quickly standardized. Cloud plays a key role in this initiative by providing massive compute and storage to enable public data repositories, large collaborative projects, and consortia.
To drive and realize the promise of omics, Intel Labs is building the Open Omics Acceleration Framework as a highly productive platform for biologists and data scientists, enabling them to harness computing and data at unprecedented scale and speed at lower costs.
In this post, we benchmark three standard omics pipelines – AlphaFold2-based protein folding, DeepVariant-based variant calling, and Scanpy-based single-cell RNA-Seq analysis – available in the latest version of Open Omics Acceleration Framework (v. 2.1) against prior CPU baseline on Xeon-based Amazon Elastic Compute Cloud (Amazon EC2) instances to showcase its performance and cost benefits.
Open Omics Acceleration Framework
The Open Omics Acceleration Framework is a one-click, containerized, customizable, open-sourced framework for accelerating analysis of omics datasets. The framework is being built with a modular design that keeps in mind the different ways the users would want to interact with it. As shown in Figure 1, it consists of three layers:
- Pipeline layer: for users who are looking for a one-click solution to run standard pipelines. The latest version (v. 2.1) supports the following pipelines:
- fq2sortedbam: Given gzipped fastq files from a sample, this workflow performs bwa-mem2-based sequence mapping and sorting to output the sorted BAM file.
- DeepVariant-based germline pipeline for variant calling (fq2vcf): Given paired end gzipped fastq files from a sample, this workflow performs sequence mapping, sorting and variant calling (using DeepVariant) to output a vcf file.
- AlphaFold2-based protein folding: Given one or more protein sequences, this workflow performs preprocessing (database search and multiple sequence alignment) and structure prediction (using AlphaFold2) to output the structure(s) of the protein sequences.
- Single-cell RNA-Seq analysis: Given a cell by gene matrix, this Scanpy workflow performs data load and preprocessing (filter, linear regression and normalization), dimensionality reduction (PCA), clustering (Louvain / Leiden / kmeans) to cluster the cells into different cell types and visualize those clusters (UMAP / t-SNE).
- Toolkit (applications) layer: for users who want to use individual tools or to create their own custom pipelines by combining various tools.
- Building blocks layer: for tool developers, this layer consists of key building blocks – biology-specific and generic-AI algorithms and data structures – that can replace ones used in existing tools to accelerate them or can be used as ingredients to build new efficient tools.
Benchmarking the Open Omics Acceleration Framework on Amazon EC2 instances
Amazon EC2 instances used for benchmarking
The five types of Amazon EC2 instances used in this benchmarking study are detailed in Table 1. All of them are powered by 4th Generation Intel® Xeon® Scalable processors. We executed all the experiments using Ubuntu version 22.04. Across the three pipelines, the execution time includes all the steps performed including reading/writing input/output files from/to local disk and therefore, includes both compute and file IO. The cloud costs are computed by multiplying the total execution time in hours by the per hour instance cost.
Prerequisites
To run these pipelines, you need an AWS account with permissions to provision Amazon Simple Storage Service (S3) buckets for input and output data storage, and sufficient permissions/limits to provision Amazon EC2 c7i, m7i and r7i instances.
Steps for benchmarking
The configuration details and steps used for benchmarking baseline and Open Omics versions of all three pipelines on Amazon EC2 instances are detailed on this GitHub page of the Open Omics Acceleration Framework. The typical process involves launching the corresponding EC2 Instances, connecting to the instances, installing the software, downloading the datasets, and executing the baseline and Open Omics versions. In the following subsections, we will share an overview of the pipelines and report the benchmarking results.
AlphaFold2-based protein folding pipeline
The protein folding problem has significant implications for drug discovery, biotechnology, and understanding the mechanisms of diseases. The task entails predicting the 3D structure of a protein from its amino acid sequence. A protein’s structure governs its function. Therefore, accurate protein structure prediction is vital in biology and drug discovery, and has long been considered a holy grail problem.
Our AlphaFold2-based protein folding pipeline takes as input a set of amino acid sequences and outputs the set of corresponding predicted structures in PDB file format. The pipeline has two stages: 1) preprocessing that includes database search and multiple sequence alignment (MSA) over protein sequences, 2) model inference that predict the structure of the protein using Evoformer, a Transformer architecture based Deep Learning (DL) model.
For baseline, we use OpenFold v. 1.0.1 – a faithful reproduction of DeepMind AlphaFold2’s model for model inference and hh-suite v. 3.3.0 and hmmer v. 3.3.2 for preprocessing. Open Omics Acceleration Framework contains faster versions of all the steps of this pipeline accelerated using a 4th gen Intel Xeon scalable processor using Intel AMX with bfloat16 precision for DL compute, Intel AVX2 and AVX-512 for non-DL compute and cache optimizations. It also provides a parallel execution framework that folds multiple proteins in parallel in a load balanced fashion.
OpenFold is a faithful PyTorch based reproduction of AlphaFold2 which can run on CPU. We compare the OpenFold CPU baseline with Open Omics on two sets of proteins sampled from the C. elegans proteome. Since a large majority of the proteins found in nature have lengths less than 1000, we create the first set with proteins of length under 1000 amino acid residues. The second set consists of proteins up to lengths that OpenFold can handle. On the same m7i.24xlarge instance as baseline, Open Omics achieves significant cost and execution time improvements. Moreover, due to its ability to fold multiple proteins in parallel, it also scales well to the larger m7i.48xlarge instance. For the first set, Open Omics achieves 10.1x speedup on preprocessing and 37.1x speedup on model inference resulting in speedup for end-to-end execution of 17x and cost reduction of 8.5x. For the second set, the speedup values for individual stages are 5.2x and 33.2x, respectively, and overall speedup is 20.8x and cost reduction is 10.4x. For longer sequences, OpenFold did not finish even after 3 days. On the other hand, Open Omics can comfortably handle sequences up to length 7500 on m7i.48xlarge instances.
DeepVariant-based variant calling pipeline (fq2vcf)
Variant calling is a fundamental task in DNA sequence analysis. Given the sequencing reads from an individual’s genome, variant calling identifies the variations in the reads against a reference genome. DeepVariant, a deep learning-based germline variant caller, is a highly accurate and widely used tool across many genomic studies.
The variant calling pipeline using DeepVariant, as shown in Figure 4, consists of the following steps: 1) mapping of the input reads to the reference genome, 2) sorting the mapped reads based on the reference coordinate, and 3) calling variants by first creating pileups of the regions that are expected to have variants followed by classifying among variants types using Inception V3 deep learning model. For baseline, we use – BWA-MEM v. 0.7.17 for mapping, samtools v. 1.16.1 sort for sorting and DeepVariant v. 1.5 for variant calling. Open Omics version of the pipeline is accelerated through use of:
- Intel Advanced Matrix Extensions (AMX) with bfloat16 precision for DL compute, Intel AVX-512 for non-DL compute,
- Cache optimizations, and
- Scaling it to the multiple CPU instances.
We compare baseline with Open Omics on the standard 30x WGS paired-end short read dataset: HG001 (R1 and R2). On the same instance type (c7i.24xlarge), Open Omics achieves 2.21x speedup compared to baseline resulting in a cost of just $8.8 per sample. For use-cases that require a quick turn-around time, Open Omics scales well to 4x c7i.48xlarge instances reducing execution time to nearly 21 mins at just $12.1 per sample – 1.62 times lower cost than baseline. It also scales further to 8x c7i.48xlarge instances, reducing execution time to just 16.8 mins while still keeping the cost lower than the baseline.
Single-cell RNA-Seq analysis pipeline
Single-cell analysis involves studying various omics (genomics, transcriptomics, proteomics, metabolomics) data and cell-cell interactions at the individual cell level. Single-cell RNA-seq (scRNA-seq) is an advanced technique that measures gene expression of individual cells, which is analyzed to study the differences in gene expression profiles across cells.
A typical workflow to analyze scRNA-seq data begins with a matrix that consists of the expression levels of the genes in each cell. The data preprocessing steps filter out the noise and uses linear regression and data normalization to correct artifacts from data collection. Subsequently, dimensionality reduction is performed followed by clustering of cells to group them by similarity in genetic activity and visualization of the clusters. With over 2 million downloads, Scanpy is one of the most widely used toolkits for this analysis.
Figure 6 illustrates the pipeline we used for this benchmarking. For clustering and visualization, we run all of the options displayed above to demonstrate the benefit of the Open Omics Acceleration Framework on all of them. For the baseline, we use scanpy v. 1.9.1. The Open Omics version is accelerated through: 1) parallelization and fusing of operations in data pre-processing, 2) use of efficient implementations from Intel® Extension for Scikit-learn, Katana Graph and Intel Lab’s Trans-Omics Acceleration Library – some of which were accelerated as a part of this effort.
We compare baseline with Open Omics on the standard dataset of 1.3 million mouse brain cells. The baseline requires the larger memory r7i.24xlarge instance. On the same instance type, Open Omics achieves a 29.3x speedup compared to baseline resulting in a cost of nearly $0.6 per sample. This cost is further improved with the optimizations applied to reduce memory requirements that enables us to use the less expensive compute optimized c7i.24xlarge instance, reducing the cost to just $0.4 per sample.
Conclusion
To accelerate the ongoing revolution of omics, Intel has built the Open Omics acceleration framework. The current version of Open Omics running on AWS instances based on 4th generation Intel Xeon processors provides significant performance and cost benefits for key pipelines for protein folding, variant calling, and single-cell RNA-Seq analysis.
To learn more about performance of Open Omics and comparison with other solutions, please refer to the following research blogs – July’23, Aug’22, June’22, and the GitHub repository – Open Omics Acceleration Framework.
The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.s