Running GATK workflows on AWS: a user-friendly solution
This post was co-authored by Michael DeRan, Scientific Consultant at Diamond Age Data Science; Chris Friedline, Scientific Consultant at Diamond Age Data Science; Netsanet Gebremedhin, Scientific Consultant at Diamond Age Data Science (Computational Biologist at Parexel at time of publication); Jenna Lang, Specialist Solutions Architect at AWS; and Lee Pang, Principal Bioinformatics Architect at AWS.
The Genome Analysis Toolkit (GATK), developed by the Data Sciences Platform team at the Broad Institute, offers a wide variety of industry-standard tools for genomic variant discover and genotyping. GATK and AWS are both widely used by the genomics community, but until now, there has not been a user-friendly method for getting GATK up and running on AWS using both GATK and AWS best practices. Diamond Age Data Science, with support from the AWS Industry Solutions Team, designed a deployable solution that facilitates running GATK best practices workflows on AWS infrastructure – leveraging existing tools so researchers can deploy compute resources to run GATK best practices workflows with ease.
To explore tradeoffs between cost and time to run, we ran this workflow on a test dataset using different Amazon EC2 instance types, storage options, and pricing structures. We chose two common use cases to inform our design: germline short variant per-sample calling and short variant joint genotyping. In this post, we cover key considerations like workflow orchestrator and architectural designs for specific performance modes.
To enable robust and reproducible performance of GATK best practices workflows with a range of AWS services, we needed to use a workflow orchestrator. Cromwell on AWS is the default orchestrator for Broad-developed GATK workflows, but we chose Nextflow for this use case, because it allows for greater flexibility in leveraging AWS storage options.
Our new Nextflow pipelines, based on GATK v4 best practices for per-sample germline short variant discovery and joint genotyping, are available in our codebase. Seqera Labs, an AWS partner, has also refactored and published them on their team’s Github.
Compute & Storage Architecture
We investigated the effects of both instance type and storage type on the cost and performance of the two GATK workflows. To test the storage type, we used two architectures that varied only with respect to the storage service used. In both cases, the input and reference data originate in Amazon S3. In Architecture A, data are transferred to Amazon Elastic Block Storage (EBS) for temporary storage while the workflow is running. In Architecture B, data are presented using FSx for Lustre, with Amazon S3 serving as the data repository. To test the effect of instance type, we used the two architectures above with three different instance type families (c5, r5, and m5,) chosen to represent compute-optimized, memory-optimized, and general-purpose Amazon EC2 instances, respectively. We also ran each workflow using both On-Demand and Spot pricing. In all scenarios, AWS Batch was used to schedule the tasks defined by the Nextflow scripts, and the workflow steps were executed using Docker containers hosted in Amazon Elasic Container Registry (ECR). The infrastructure deployed using Architecture A leveraged the AWS Cloud Development Kit (CDK), while Architecture B was deployed from within Nextflow Tower. All resources used for this work were deployed in region us-east-2.
Architecture A: Workflow infrastructure using EBS storage, deployed with CDK.
In this architecture, each worker instance has attached EBS storage where the Nextflow work directory resides. At the beginning of each job, Nextflow stages input files from S3 to the EBS volume. At the end of each job output files are copied to S3 from EBS. The Nextflow head node always runs in an On-Demand compute environment, to avoid interruptions due to Spot instance reclamation. Because Nextflow manages workflow restart in the case of interruption, we have the choice to run worker nodes in either an On-Demand or a Spot compute environment.
Architecture B: Workflow infrastructure using FSx for Lustre storage, deployed with Nextflow Tower
In this environment, worker instances mount an FSx for Lustre file system where the Nextflow work directory is located. This file system was configured for scratch storage with throughput of 200 MB/s/TiB. Access to the shared file system obviates the need to stage common input files multiple times or to move files to S3 at the end of each job. This is a “best of both worlds” scenario with a fast-shared filesystem that is backed with the increased durability and redundancy of S3.
Using both of these architectures, we tested the performance of the c5, m5 and r5 instance families. In all cases 1024 cores were made available to the compute environment. For all instance families, we observed job failures when many jobs were launched simultaneously on a single instance — a problem that entailed a significant cost in run time. To prevent this issue, we limited the size of the instances used in the compute environment, thus limiting the number of jobs run on any single instance. We did not observe any job failures after this adjustment. The instance types used were:
- c5: c5.large, c5.xlarge, c5.2xlarge, c5.4xlarge, c5.9xlarge
- m5: m5.large, m5.xlarge, m5.2xlarge, m5.4xlarge, m5.8xlarge
- r5: r5.large, r5.xlarge, r5.2xlarge, r5.4xlarge, r5.8xlarge
We chose 50 low coverage whole genome sequencing runs from the 1000 Genomes Project for our inputs. These samples are available in FASTQ format (used by most researchers for WGS data) in the Registry of Open Data on AWS.
We ran each of these pipelines at least three times for each test. Nextflow includes support for pipeline monitoring with Nextflow Tower. This allowed us to easily track compute costs and run time for each run from within Nextflow Tower. Run times reported here are wall time, the actual amount of time that passed from the initial request for resources until the run completed. Storage costs incurred during each run were estimated based on the maximum storage capacity for each job. Costs for storage were calculated using AWS public pricing for the us-east-2 region at a base price of 0.000138 $/GB-hour for EBS and 0.0001944 $/GB-hour for FSx for Lustre. All data collected and scripts for calculating cost can be found in the project repository.
Per-Sample Short Variant Discovery
Figure 1: Per-sample cost and total run time for the Per-Sample Short Variant Discovery workflow.
In Figure 1A and 1B: Boxplot of the relative cost per sample and relative run time, respectively, for each run of the pipeline in the indicated compute environments using On-Demand instances for task execution. Instance families (c5, m5, r5) are shown as facets and storage type categories are plotted on the x-axis. In Figure 1C and 1D: The relative cost per sample and relative run time, respectively, for environments with Spot instances are plotted as in A and B. Cost and run time were calculated relative to the mean cost and run time for On-Demand purchased c5 instances with EBS storage.
In all cases, environments using FSx for Lustre were faster than those using only EBS storage (Figures 1B and 1D). This faster run time also translated into lower prices despite the higher base price for FSx for Lustre storage (Figure 1A); more than 95% of the costs were from compute.
For per-sample short variant discovery, the most cost-effective option was to use Spot purchasing of instances: the combination of FSx for Lustre and c5 instances cost 71% less than the On-Demand purchased c5 instances with EBS storage. (Figures 1C, 1D).
The On-Demand purchased instances were faster than Spot-purchased instances (Figures 1B and 1D). Again, it was FSx for Lustre in combination with c5 instances that took the top spot.
|Purchasing Strategy||Instance Family||Storage||Relative Cost/Sample||Relative Run Time||Relative CPU Time|
Note that the total workflow run time on Spot can be 4-6x more than On-Demand in the configuration tested. One shouldn’t draw too many conclusions from this result. In this study, we are isolating the effect of instance type on performance. This artificially limits the available Spot pool size and has an add-on effect of additional wait times for a Spot Instance to be available. The longer workflow run times are due to jobs waiting in queue. You’re not paying for that time. Instead, note that the actual CPU time is about the same as On-Demand, while the cost is roughly a third. The easiest way to optimize this is to diversify the pool of Spot Instances Batch can launch. In a real world scenario, you would configure your Compute Environment with a mixture of C, M, and R instance types and use a SPOT_CAPACITY_OPTIMIZED allocation strategy. Doing so would give you both run times similar to using On-Demand (i.e., much less waiting for an instance) and cheaper instance costs.
In Figure 2A and 2B: Boxplot of the relative cost per sample and relative run time, respectively, for each run of the pipeline in the indicated compute environments using On-Demand instances for task execution. Instance families (c5, m5, r5) are shown as facets and storage type categories are plotted on the x-axis. In Figure 2C and 2D: The relative cost per sample and relative run time, respectively, for environments with Spot instances are plotted as in A and B. Cost and run time were calculated relative to the mean cost and run time for On-Demand purchased c5 instances with EBS storage.
For this workflow, FSx for Lustre was again faster than the EBS environments (Figures 2B and 2D). The fastest condition was On-Demand c5 instances paired with FSx for Lustre. The least expensive On-Demand option was r5 instances with FSx for Lustre.
For the Spot Instances the combination of r5 instances with EBS storage was the least expensive. Perhaps the biggest surprise in these results is that purchasing Spot instances in most cases was only marginally slower than using On-Demand Instances. This represents a tremendous cost savings with very little compromise on speed.
|Purchasing Strategy||Instance Family||Storage||Relative Cost/Sample||Relative Run Time||Relative CPU Time|
For per-sample short variant discovery, the best configuration both for speed and cost used C5 for compute and FSX for Lustre for storage. The least expensive runs used these options on Spot Instances, and the fastest runs used On-Demand Instances.
For joint genotyping, C5 compute instances were somewhat higher in cost than M5 or R5. FSX for Lustre provided a small edge in speed only when using On-Demand Instances, at similar costs to EBS. However, when using Spot Instances, EBS provided the lowest-cost option with no penalty to speed.
Lastly, in a real world scenario, it is recommended to diversify the instance types available in your Batch Compute Environments. This lets Batch do what it was designed for – automatically scheduling tasks efficiently based on individual task computing requirements, like CPU and memory.
Stay tuned for more improvements, including an attempt to make the system even faster.
In the meantime, we hope this recipe will be helpful to a wide variety of users who need a simple way to run and manage variant calling on AWS. If you are one of those users, we would love to hear from you. To get started, see the code for the AWS GATK Stack and the AWS Nextflow Stack.