AWS for Industries

Running GATK workflows on AWS: a user-friendly solution

This post was co-authored by Michael DeRan, Scientific Consultant at Diamond Age Data Science; Chris Friedline, Scientific Consultant at Diamond Age Data Science; Netsanet Gebremedhin, Scientific Consultant at Diamond Age Data Science (Computational Biologist at Parexel at time of publication); Jenna Lang, Specialist Solutions Architect at AWS; and Lee Pang, Principal Bioinformatics Architect at AWS. 

Diamond Age Data Science logo

The Genome Analysis Toolkit (GATK), developed by the Data Sciences Platform team at the Broad Institute, offers a wide variety of industry-standard tools for genomic variant discover and genotyping. GATK and AWS are both widely used by the genomics community, but until now, there has not been a user-friendly method for getting GATK up and running on AWS using both GATK and AWS best practices. Diamond Age Data Science, with support from the AWS Industry Solutions Team, designed a deployable solution that facilitates running GATK best practices workflows on AWS infrastructure – leveraging existing tools so researchers can deploy compute resources to run GATK best practices workflows with ease.

To explore tradeoffs between cost and time to run, we ran this workflow on a test dataset using different Amazon EC2 instance types, storage options, and pricing structures. We chose two common use cases to inform our design: germline short variant per-sample calling and short variant joint genotyping. In this post, we cover key considerations like workflow orchestrator and architectural designs for specific performance modes.

Methods

Workflow Orchestrator

To enable robust and reproducible performance of GATK best practices workflows with a range of AWS services, we needed to use a workflow orchestrator. Cromwell on AWS is the default orchestrator for Broad-developed GATK workflows, but we chose Nextflow for this use case, because it allows for greater flexibility in leveraging AWS storage options.

Our new Nextflow pipelines, based on GATK v4 best practices for per-sample germline short variant discovery and joint genotyping, are available in our codebase. Seqera Labs, an AWS partner, has also refactored and published them on their team’s Github.

Compute & Storage Architecture

We investigated the effects of both instance type and storage type on the cost and performance of the two GATK workflows. To test the storage type, we used two architectures that varied only with respect to the storage service used. In both cases, the input and reference data originate in Amazon S3. In Architecture A, data are transferred to Amazon Elastic Block Storage (EBS) for temporary storage while the workflow is running. In Architecture B, data are presented using FSx for Lustre, with Amazon S3 serving as the data repository. To test the effect of instance type, we used the two architectures above with three different instance type families (c5, r5, and m5,) chosen to represent compute-optimized, memory-optimized, and general-purpose Amazon EC2 instances, respectively. We also ran each workflow using both On-Demand and Spot pricing. In all scenarios, AWS Batch was used to schedule the tasks defined by the Nextflow scripts, and the workflow steps were executed using Docker containers hosted in Amazon Elasic Container Registry (ECR). The infrastructure deployed using Architecture A leveraged the AWS Cloud Development Kit (CDK), while Architecture B was deployed from within Nextflow Tower. All resources used for this work were deployed in region us-east-2.

Architecture A: Workflow infrastructure using EBS storage, deployed with CDK.

In this architecture, each worker instance has attached EBS storage where the Nextflow work directory resides. At the beginning of each job, Nextflow stages input files from S3 to the EBS volume. At the end of each job output files are copied to S3 from EBS. The Nextflow head node always runs in an On-Demand compute environment, to avoid interruptions due to Spot instance reclamation. Because Nextflow manages workflow restart in the case of interruption, we have the choice to run worker nodes in either an On-Demand or a Spot compute environment.

Architecture A: Workflow infrastructure using EBS storage, deployed with CDK

Architecture A: Workflow infrastructure using EBS storage, deployed with CDK

Architecture B:  Workflow infrastructure using FSx for Lustre storage, deployed with Nextflow Tower

In this environment, worker instances mount an FSx for Lustre file system where the Nextflow work directory is located. This file system was configured for scratch storage with throughput of 200 MB/s/TiB. Access to the shared file system obviates the need to stage common input files multiple times or to move files to S3 at the end of each job. This is a “best of both worlds” scenario with a fast-shared filesystem that is backed with the increased durability and redundancy of S3.

Architecture B: Workflow infrastructure using FSx for Lustre storage, deployed with Nextflow Tower

Architecture B: Workflow infrastructure using FSx for Lustre storage, deployed with Nextflow Tower

Using both of these architectures, we tested the performance of the c5, m5 and r5 instance families. In all cases 1024 cores were made available to the compute environment.  For all instance families, we observed job failures when many jobs were launched simultaneously on a single instance — a problem that entailed a significant cost in run time. To prevent this issue, we limited the size of the instances used in the compute environment, thus limiting the number of jobs run on any single instance. We did not observe any job failures after this adjustment. The instance types used were:

  • c5: c5.large, c5.xlarge, c5.2xlarge, c5.4xlarge, c5.9xlarge
  • m5: m5.large, m5.xlarge, m5.2xlarge, m5.4xlarge, m5.8xlarge
  • r5: r5.large, r5.xlarge, r5.2xlarge, r5.4xlarge, r5.8xlarge

Data Provenance

We chose 50 low coverage whole genome sequencing runs from the 1000 Genomes Project for our inputs. These samples are available in FASTQ format (used by most researchers for WGS data) in the Registry of Open Data on AWS.

We ran each of these pipelines at least three times for each test. Nextflow includes support for pipeline monitoring with Nextflow Tower. This allowed us to easily track compute costs and run time for each run from within Nextflow Tower. Run times reported here are wall time, the actual amount of time that passed from the initial request for resources until the run completed. Storage costs incurred during each run were estimated based on the maximum storage capacity for each job. Costs for storage were calculated using AWS public pricing for the us-east-2 region at a base price of 0.000138 $/GB-hour for EBS and 0.0001944 $/GB-hour for FSx for Lustre. All data collected and scripts for calculating cost can be found in the project repository.

Results

Per-Sample Short Variant Discovery

On-Demand Instances

On-demand spot instances 1A and 1B

Figures 1A and 1B: Per-sample cost and total workflow run time for the Per-Sample Short Variant Discovery workflow with On-Demand instances

Spot Instances

Spot instances figure 1C and 1D

Figures 1C and 1D: Per-sample cost and total workflow run time for the Per-Sample Short Variant Discovery workflow with Spot Instances

Figure 1: Per-sample cost and total run time for the Per-Sample Short Variant Discovery workflow.

In Figure 1A and 1B: Boxplot of the relative cost per sample and relative run time, respectively, for each run of the pipeline in the indicated compute environments using On-Demand instances for task execution. Instance families (c5, m5, r5) are shown as facets and storage type categories are plotted on the x-axis. In Figure 1C and 1D: The relative cost per sample and relative run time, respectively, for environments with Spot instances are plotted as in A and B. Cost and run time were calculated relative to the mean cost and run time for On-Demand purchased c5 instances with EBS storage.

In all cases, environments using FSx for Lustre were faster than those using only EBS storage (Figures 1B and 1D). This faster run time also translated into lower prices despite the higher base price for FSx for Lustre storage (Figure 1A); more than 95% of the costs were from compute.

For per-sample short variant discovery, the most cost-effective option was to use Spot purchasing of instances: the combination of FSx for Lustre and c5 instances cost 71% less than the On-Demand purchased c5 instances with EBS storage. (Figures 1C, 1D).

The On-Demand purchased instances were faster than Spot-purchased instances (Figures 1B and 1D). Again, it was FSx for Lustre in combination with c5 instances that took the top spot.

Purchasing Strategy Instance Family Storage Relative Cost/Sample Relative Run Time Relative CPU Time
On-demand c5 EBS 1 1 1
On-demand c5 FSx 0.95 0.93 0.95
On-demand m5 EBS 1.6 1.22 1.42
On-demand m5 FSx 1.18 1.03 1.04
On-demand r5 EBS 2.11 1.2 1.44
On-demand r5 FSx 1.54 0.99 1.04
Spot c5 EBS 0.29 4.11 1.119
Spot c5 FSx 0.29 2.05 0.98
Spot m5 EBS 0.34 6.52 1.18
Spot m5 FSx 0.3 1.76 1.03
Spot r5 EBS 0.36 4.42 1.25
Spot r5 FSx 0.36 2.67 1.07

Note that the total workflow run time on Spot can be 4-6x more than On-Demand in the configuration tested. One shouldn’t draw too many conclusions from this result. In this study, we are isolating the effect of instance type on performance. This artificially limits the available Spot pool size and has an add-on effect of additional wait times for a Spot Instance to be available. The longer workflow run times are due to jobs waiting in queue. You’re not paying for that time. Instead, note that the actual CPU time is about the same as On-Demand, while the cost is roughly a third. The easiest way to optimize this is to diversify the pool of Spot Instances Batch can launch. In a real world scenario, you would configure your Compute Environment with a mixture of C, M, and R instance types and use a SPOT_CAPACITY_OPTIMIZED allocation strategy. Doing so would give you both run times similar to using On-Demand (i.e., much less waiting for an instance) and cheaper instance costs.

Joint Genotyping

On-Demand Instances

On-demand instances figure 1A and 1B

Figures 2A and 2B: Per-sample cost and workflow run time for the Joint Genotyping workflow with On-Demand instances.

Spot Instances

Spot Instances figure 1C and 1D

Figures 2C and 2D: Per-sample cost and workflow run time for the Joint Genotyping workflow with Spot Instances

In Figure 2A and 2B: Boxplot of the relative cost per sample and relative run time, respectively, for each run of the pipeline in the indicated compute environments using On-Demand instances for task execution. Instance families (c5, m5, r5) are shown as facets and storage type categories are plotted on the x-axis. In Figure 2C and 2D: The relative cost per sample and relative run time, respectively, for environments with Spot instances are plotted as in A and B. Cost and run time were calculated relative to the mean cost and run time for On-Demand purchased c5 instances with EBS storage.

For this workflow, FSx for Lustre was again faster than the EBS environments (Figures 2B and 2D). The fastest condition was On-Demand c5 instances paired with FSx for Lustre. The least expensive On-Demand option was r5 instances with FSx for Lustre.

For the Spot Instances the combination of r5 instances with EBS storage was the least expensive. Perhaps the biggest surprise in these results is that purchasing Spot instances in most cases was only marginally slower than using On-Demand Instances. This represents a tremendous cost savings with very little compromise on speed.

Purchasing Strategy Instance Family Storage Relative Cost/Sample Relative Run Time Relative CPU Time
On-demand c5 EBS 1 1 1
On-demand c5 FSx o.94 0.85 0.9
On-demand m5 EBS 0.68 1.07 1.1
On-demand m5 FSx 0.7 0.97 1.05
On-demand r5 EBS 0.71 1.17 1.17
On-demand r5 FSx 0.67 0.97 1.02
Spot c5 EBS 0.2 1.03 0.99
Spot c5 FSx 0.38 0.99 0.96
Spot m5 EBS 0.14 1.44 1.09
Spot m5 FSx 0.32 1.08 1.05
Spot r5 EBS 0.32 1.08 1.05
Spot r5 FSx 0.28 1.01 1.04

Conclusions

For per-sample short variant discovery, the best configuration both for speed and cost used C5 for compute and FSX for Lustre for storage. The least expensive runs used these options on Spot Instances, and the fastest runs used On-Demand Instances.

For joint genotyping, C5 compute instances were somewhat higher in cost than M5 or R5. FSX for Lustre provided a small edge in speed only when using On-Demand Instances, at similar costs to EBS. However, when using Spot Instances, EBS provided the lowest-cost option with no penalty to speed.

Lastly, in a real world scenario, it is recommended to diversify the instance types available in your Batch Compute Environments. This lets Batch do what it was designed for – automatically scheduling tasks efficiently based on individual task computing requirements, like CPU and memory.

Stay tuned for more improvements, including an attempt to make the system even faster.

In the meantime, we hope this recipe will be helpful to a wide variety of users who need a simple way to run and manage variant calling on AWS. If you are one of those users, we would love to hear from you. To get started, see the code for the AWS GATK Stack and the AWS Nextflow Stack.

Michael DeRan

Michael DeRan

Michael brings two perspectives to Diamond Age: the skills of a computational biologist and the expertise of a bench biochemist. His work ranges from transcriptional analysis (including both bulk and single-cell RNA-seq data) to selection of promising candidates from high-throughput small-molecule screens. Prior to joining Diamond Age, he worked on early drug discovery efforts in type 2 diabetes at the Broad Institute of MIT and Harvard, and on chemical biology approaches to understanding the roles of signaling pathways in cancer at Massachusetts General Hospital. Michael holds a Ph.D. in biochemistry from the University of Rochester School of Medicine and Dentistry, and a B.S. from Saint Bonaventure University, also in biochemistry.

Chris Friedline

Chris Friedline

Chris offers Diamond Age clients a unique combination of expertise in IT and bioinformatics. Previously, he was Director of Bioinformatics and IT at Granger Genetics, which used metagenomics and other molecular approaches to inform infection management and evaluate the effectiveness of cancer treatments. He also held leadership positions in IT and systems engineering for a range of organizations including the Virginia Commonwealth University (VCU) Health System. Chris has an M.S. in bioinformatics and a Ph.D. in integrative life sciences from VCU, where he also completed an NSF-funded postdoc in ecological genomics.

Jenna Lang

Jenna Lang

Jenna is a Specialist Solutions Architect, specializing in Bioinformatics and AI/ML for Life Sciences. She has a PhD in Microbiology and more than 20 years of experience in bioinformatics, initially working on the Human Genome Project, and focusing on Microbiome analytics for most of her career. She has built multi-omics pipelines on AWS for multiple companies in Agriculture and Precision Health. At AWS, she helps Life Sciences researchers and companies accelerate discovery by removing bottlenecks in large-scale data analysis.

Netsanet Gebremedhin

Netsanet Gebremedhin

Net is a "T-shaped" scientist, with both wide experience and deep expertise in building complex computational pipelines for large genomic datasets. Most recently, he worked for Agenus Inc. on the development of individualized cancer vaccines, overseeing identification of neoantigens and analysis of whole exome and bulk RNA sequencing data. He has held computational roles in CLIA-accredited diagnostics labs at Partners HealthCare Personalized Medicine and Courtagen Life Sciences. His sequencing analysis experience was instrumental at New England BioLabs, where he helped develop solutions for optimizing sample preparation workflows. He was also a bench scientist at Progenika Biopharma and Wyeth Research, where he worked on a wide range of tasks such as Sanger Sequencing-based assay development for blood group genotyping, oligo synthesis, and primer design. Net has an M.S. in Bioinformatics from Brandeis University and a B.S. in Biology and Chemistry from Addis Ababa University. As of this publication, Net is a computational biologist at Parexel.

Lee Pang

Lee Pang

Lee is a Principal Bioinformatics Architect with the Health AI services team at AWS. He has a PhD in Bioengineering and over a decade of hands-on experience as a practicing research scientist and software engineer in bioinformatics, computational systems biology, and data science developing tools ranging from high throughput pipelines for *omics data processing to compliant software for clinical data capture and analysis.