AWS Partner Network (APN) Blog
Running Bioinformatics Pipelines Cost Effectively Using MemVerge on AWS
By Jing Xie and Charlie Yu – MemVerge
By Gokhul Srinivasan and Sujaya Srinivasan – AWS
MemVerge |
The MemVerge Memory Machine Cloud (MMCloud) solution on Amazon Web Services (AWS) makes it easy for researchers to migrate their bioinformatics pipelines from on-premises high-performance computing (HPC) to AWS. It also enables executing long-running bioinformatics pipelines cost-effectively on Amazon EC2 Spot instances through automatic checkpoint and restore.
Over the last decade, as the cost of sequencing has dropped researchers are generating larger volumes of raw sequencing data. This raw data is processed through complex multi-step pipelines that require significant compute resources before the data is usable in any research or analysis.
Traditionally, most researchers in academic institutions have relied on HPC clusters managed by a central resource within their institutions. However, the increasing compute demands and aging infrastructure make it challenging to get the compute resources they need in a timely manner.
The cloud is an attractive alternative to get the elastic compute that research groups need, but they often are blocked by lack of IT support and skills necessary to get their pipelines cloud-ready to run them cost-effectively.
In this post, we will explore how MemVerge Memory Machine Cloud is built to run computational workflows and interactive computing applications on AWS. Designed for use by genomic researchers and bioinformaticians, MMCloud enables you to run your Nextflow pipelines, next-generation sequencing (NGS), and other genomic analysis safely and reliably on EC2 Spot instances.
MemVerge is an AWS Specialization Partner and AWS Marketplace Seller with the EC2 Spot service ready designation.
MemVerge Memory Machine Cloud
Some of the key features of the MMCloud solution are:
- Cost optimization through SpotSurfer: SpotSurfer is an innovative feature that unlocks the full potential of EC2 Spot instances. By enabling checkpointing, workflows resume seamlessly from the last saved state, even if an instance is interrupted. This means you can leverage the cost benefits of EC2 Spot instances without the usual risk of interruption, ensuring reliability and affordability.
- Observability through WaveWatcher: Providing real-time resource utilization and observability for each EC2 instance, WaveWatcher ensures you have all the necessary insights at your fingertips. With its detailed and actionable metrics, you can monitor computational workloads and make informed decisions to optimize performance.
- Dynamic right-sizing through WaveRider: WaveRider introduces real-time EC2 right-sizing, allowing dynamic migration to larger or smaller instances based on the resource needs of each job. This ensures you can always run on the most optimal EC2 instances, balancing performance and cost, and avoiding under or over-provisioning. Users can set thresholds in their configuration, which WaveRider will use to automatically resize and move the job to a larger or smaller instance.
- Nextflow integration: For Nextflow users, the nf-float plugin is a seamless integration that enhances workflow execution on AWS. This plugin allows Nextflow to utilize MMCloud as a computing environment. The integration simplifies the complexities of cloud infrastructure management, making it easier for bioinformaticians and data scientists to run on EC2 Spot instances without worrying about spot interruptions.
Use Cases
Dynamic Resizing and Optimization of Resources
Most bioinformatics pipelines have multiple steps, with each step having different resource requirements. For example, in Sentieon’s implementation of the GATK best practices workflow, there are three phases:
- Data localization (from Amazon S3 to local storage).
- Sequence alignment (FASTQ to BAM).
- De-duplication, BQSR, and variant calling (BAM to VCF).
To establish a baseline, the whole genome sequencing (WGS) pipeline is first run on a single r5.8xlarge instance to get CPU and memory benchmarks, as shown in Figure 1.
Figure 1 – CPU and memory usage for a WGS pipeline.
In the following table, you can see CPU and memory metrics for the Sentieon WGS pipeline phases on r5.8xlarge instances.
Phase 1 | Phase 2 | Phase 3 | Complete Run | |
ECs instance | r5.8xlarge | r5.8xlarge | r5.8xlarge | |
vCPUs | 32 | 32 | 32 | |
Memory (GB) | 256 | 256 | 256 | |
Wall clock time (minutes) | 15 | 143 | 49 | 207 |
Cost ($ USD)* | 0.51 | 4.87 | 1.67 | 7.06 |
* Costs are based on us-east-1 on-demand EC2 instance prices as of the publishing date.
Based on the real-time profiling above, we can resize the instances for each step as follows, optimizing cost and run-time. The next table shows CPU and memory metrics for Sentieon WGS pipeline after right-sizing the instances for each phase based on profiling metrics from the table above.
Phase 1 | Phase 2 | Phase 3 | Complete Run | |
EC2 instance | r5.xlarge | c5.18xlarge | c5.9xlarge | |
vCPUs | 4 | 72 | 36 | |
Memory (GB) | 3 | 144 | 72 | |
Wall clock time (minutes) | 15 | 64 | 44 | 123 |
Runtime difference from baseline (%) | 0 | -55.2 | -10.2 | -40.5 |
Cost ($ USD) | 0.09 | 3.41 | 1.19 | 4.69 |
Cost difference from baseline (%) | -82.4 | -30.0 | -28.7 | -33.6 |
* Costs are based on us-east-1 on-demand EC2 instance prices as of the publishing date. Note that costs could be lower by leveraging Spot instances.
Hybrid HPC Solution
For hybrid cloud organizations where there’s an on-premises HPC footprint but researchers want to burst scientific computing workloads into AWS, MMCloud offers a simple and highly cost-effective solution. It enables you to run multiple jobs at once on AWS within your own virtual private cloud (VPC) using command line interface (CLI), MMCloud graphical user interface (GUI), and HPC scheduler interfaces like Slurm, PBS/QSUB, and LSF/OpenLava.
Resource provisioning and deprovisioning is automated, with quotas that help IT admins manage costs. MMCloud provides cost-effective execution via EC2 Spot instance checkpoint and recovery combined with real-time resource management to scale up jobs that need more CPU/memory and scale down jobs that do not.
Figure 2 below shows a hybrid HPC cloud architecture, where researchers can submit jobs either to their on-premises HPC cluster or to an MMCloud-managed Spot instances queue in their own VPC.
Figure 2 – Hybrid HPC cloud architecture.
Running Long-Running Nextflow Pipelines on EC2 Spot Instances
MMCloud’s ability to checkpoint and recover from Spot instance interruptions is a critical component enabling Nextflow pipelines to run more cost efficiently on AWS. You can easily integrate MMCloud as a computing environment for Nextflow on AWS by using the nf-float plugin.
Using MMCloud, Nextflow users can easily deploy JuiceFS as a high-performance cloud-native, POSIX compatible distributed file system on Amazon Simple Storage Service (Amazon S3) for shared working directory.
For details on how to set up MMCloud with Nextflow, see the documentation.
Figure 3 shows the architecture of MMCloud running Nextflow pipelines through the nf-float plugin using JuiceFS.
Figure 3 – MMCloud architecture running Nextflow pipelines.
Customer Stories
Statistical Functional Genomics Lab at Columbia University
At Columbia University, Dr. Gao Wang (Assistant Professor of Neurological Sciences) heads the Lab of Statistical Functional Genomics, which focuses on understanding the genetic regulation of molecular mechanisms behind complex biological traits.
A key initiative led by Dr. Wang is the FunGen-xQTL Project, a collaborative effort involving over a dozen research institutes across the United States and focuses on studying molecular quantitative trait loci in aging brains. Understanding genetic regulation plays a crucial role in providing the Alzheimer’s disease scientific communities with valuable functional genomics data from aging cohorts, curated and processed through comprehensive multi-omics analysis.
This project had two key requirements that were challenging with on-premises infrastructure:
- Required computationally intensive tasks at genome scale. The researchers needed to apply a Bayesian model on tens of thousands of genetic variables, under hundreds of cellular, tissue, ancestry and disease combinations evaluated at each of the ~30,000 genes across the human genome.
- Analysis results in a large number of output files that need to be shared with multiple institutional collaborators without having to duplicate and copy data.
MMCloud on AWS enabled Dr. Wang’s lab to submit hundreds of thousands of jobs to AWS and run them cost effectively on EC2 Spot instances. This reduced the time from several weeks to a few days and at 50-80% lower cost vs. using On-Demand instances. MMCloud also simplified and enabled cost-efficient provisioning and management of Jupyter and RStudio apps used by multiple institutional collaborators.
MDI Biological Laboratory
At MDI Biological Laboratory, Joel H. Graber, Ph.D., a senior staff scientist and director of comparative genomics and data science core leads a team focused on collaboration, analysis, and education in the computational analysis of genome-scale data. Dr. Graber leads the development of Axobase, an online resource that’s being built to provide data and tools in support of an international group of researchers across more than 40 labs who study axolotls and other salamanders.
Axolotls are interesting to study because of their amazing tissue regeneration abilities. Due to its very large genome size (roughly 10 times as large as the human genome) and structure, analyzing axolotl sequence data can be very computationally intensive, with individual analysis runs lasting up to several days using very large computing resources.
In order to standardize and automate the analysis pipelines that would be used for genomic analysis Dr. Graber’s team began writing their workflows using Nextflow. However, they ran into challenges with right-sizing instances and managing Spot instance terminations, resulting in high costs. By leveraging MMCloud on AWS, Dr. Graber’s team deployed a solution to both right-size and better utilize EC2 Spot instances when running Nextflow pipelines, saving 50-80% vs. using On-Demand and reduced CPU hours per pipeline by up to 60%.
Conclusion
The MemVerge Memory Machine Cloud (MMCloud) solution on AWS can be a cost-effective way of running bioinformatics pipelines leveraging Amazon EC2 Spot instances. It enables researchers an easy way to lift-and-shift their on-premises workloads to AWS.
By dynamically resizing EC2 instances based on actual CPU and memory usage, MMCloud makes it easy to right-size compute. It also simplifies executing long-running Nextflow pipelines on Spot instances, keeping costs under control.
MMCloud installs as a single Amazon Machine Image (AMI) with an AWS CloudFormation template in your VPC, and AWS users can easily find the solution on AWS Marketplace. Installation can be as quick as 15-20 mins and a 30-day free trial is offered to all new customers.
For more information, see the MemVerge MMCloud offering on AWS Marketplace or contact your AWS team.
MemVerge – AWS Partner Spotlight
MemVerge is an AWS Specialization Partner whose cloud automation platform (MMCloud) is designed for bioinformaticians and data scientists to easily run computational workflows on AWS.