A generalized approach to benchmarking genomics workloads in the cloud: Running the BWA read aligner on Graviton2

Amazon Web Services (AWS) announced the custom built Arm-based Graviton series of instances for Amazon Elastic Compute Cloud (Amazon EC2) in 2018. The second series, Graviton2 instances, utilize 64-bit Arm Neoverse cores and deliver up to 40 percent better price performance over comparable current generation x86-based instances. At re:Invent 2020, AWS added to the ARM portfolio with Graviton2-based C6g compute and R6g database instances.

The AWS Cloud gives genomics researchers access to a wide variety of instance types and chip architectures and this elasticity allows us to rethink genomics workflows when running workloads in the cloud. This post highlights how to benchmark several different configurations, rapidly and relatively inexpensively. Many genomics applications are written to run on x86 architecture; however where source code is available, some can be recompiled to make use of the Arm architecture. Given the increased performance of the Graviton2 instances, we wanted to explore if they can be used for cost-effective and performant genomics workloads. Sequence read alignment is often one of the more compute intensive parts of a genomics workflow, so we recompiled the Burrows-Wheeler Aligner (BWA) application for Arm-based chips and evaluated their cost effectiveness. Read on to learn about our generalized approach for determining the most effective instance type for running genomics workloads in the cloud.

Compiling BWA for the Arm64 architecture

As a first step, to get BWA running on the Graviton2 instances, we compiled BWA from source code:

    sudo yum groupinstall "Development Tools" -y
    git clone https://github.com/lh3/bwa.git
    cd bwa/
    sed -i -e 's/<emmintrin.h>/"sse2neon.h"/' ksw.c
    wget https://gitlab.com/arm-hpc/packages/uploads/ca862a40906a0012de90ef7b3a98e49d/sse2neon.h
    make clean all

The most critical step to building BWA for ARM is using sse2neon in place of emmintrin (emmintrin is not yet available for Arm chips). With the executable compiled, we ran a BWA-MEM command to align some example reads (ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz) from the “Genome in a Bottle” project to version 37 of the reference human genome. To avoid having to stage the reads and the reference genome from Amazon Simple Storage Service (Amazon S3) to the instance, an Amazon FSx for Lustre S3 file system backed by the bucket containing the relevant files was used. Amazon FSx for Lustre can cache files from Amazon S3 and present them as a POSIX compliant Lustre filesystem avoiding the need for the BWA executable to interact with S3 paths. Using all eight threads on an m6g.2xlarge an alignment was successfully produced, validating the approach.

What is the best instance type for my analysis?

Benchmarking is subjective as it is highly dependent on a user’s specific needs and what they want to prioritize, such as compute time versus cost. Amazon EC2 costs and availability can vary so the actual Region in which you run your analysis can have an impact. Given the variability, we wanted to show how the elastic properties of the cloud could be used to determine the best machine for your needs. This example focuses on how it could be done. We explicitly do not want to suggest that the optimal setup for this demonstration will fit your individual use case. One size does not fit all. Adapting it to your needs should help to identify a configuration that works for you.

To run the benchmarks, we created an Amazon EC2 LaunchTemplate for each of the x86_64 and arm64 architectures. As well as setting up the instances with a common infrastructure, the “User Data” section of the LaunchTemplate, which is run at launch time, was used to download the BWA source and compile it appropriately for the architecture. The LaunchTemplate also mounted an FSx for Lustre file system that contained the reference genome and sequence reads. When backed with files in S3, FSx for Lustre gives us a high-performance POSIX filesystem for genomics work. All instances were running the latest version of Amazon Linux 2 and used an encrypted eight GB GP2 SSD as the root drive (either instance storage or Amazon Elastic Block Store (EBS) depending on the instance type).

The LaunchTemplate User Data also contained commands to run the BWA-MEM command using all available virtual CPUs (vCPUs) and writing logs to the FSx for Lustre file system, terminating the machines when complete. A shell script used the AWS CLI to launch Amazon EC2 instances with this LaunchTemplate for each of the c5, m5, r5, c6g, m6g, and r6g types including their disk and network optimized variants. Each Amazon EC2 type was run at least three times to allow for an average of individual run times by type.

We used a Python script to compile the relevant log information and to calculate the average compute cost and time for each configuration and a Jupyter Notebook to produce plots of “cost efficiency.” Here we define the most cost-efficient machine as being the machine that runs the read alignment in the shortest time for the lowest cost. In the following plots, the most cost-efficient instances are those in the bottom left corner. Users who give a higher weighting to speed or want to use a different metric to determine the optimal configuration would want to ensure the relevant metrics are being collected during their runs and modify these scripts accordingly. The visualizations produced by MatPlotLib helped us understand which parameters influenced our metric of choice.

Check out this git for the scripts that were used to set up the environment and run the benchmarks and the notebook used to produce the analysis presented below.

Which architecture?

In all of the following figures, each point represents the average of at least three runs per instance type. Markers and colors denote certain characteristics. In the chart below, each blue circle represents one of instance types possessing four arm64 vCPUs including c6g.2xlarge, m6g.2xlarge, r6g.2xlarge, and their various network (n) and local disk (d) optimized variants. Overall, the analysis (Figure 1) shows that the arm64 architecture is generally more cost efficient than x86_64 architectures.

Figure 1: Cost efficiency by architecture. Blue dots represent arm64 instances and orange dots represent x86_64 instances.

How many vCPUs?

The number of available vCPUs generally increase the speed of an alignment with the optimum number being 16 or in some instances 32 (Figure 2). Beyond 32 vCPUs, there is no performance advantage while the run cost increases.

Figure 2: Cost efficiency by vCPU, dots are for arm64 instances and crosses are for x86_64 instances

Figure 2: Cost efficiency by vCPU. Dots represent arm64 instances and crosses represent x86_64 instances.

How much memory?

There are a variety of memory configurations that are cost efficient (Figure 3). The most cost optimal memory configuration for the alignment is the m6g.8xlarge with 131,072 MB of RAM.

Figure 3: Cost efficiency by RAM available on the instance used (in megabytes).

Does local storage in an instance help?

Using local instance storage provided no real differential when compared with EBS storage (Figure 4). It may be that all the required data structures can be placed in memory so disk I/O is not as relevant as compute power and memory access.

Figure 4: Cost efficiency by availability of local storage on the instance, blue dots are for instances with no local storage and orange dots are for instances with local storage

Figure 4: Cost efficiency by availability of local storage on the instance. Blue dots represent instances with no local storage and orange dots represent instances with local storage.

Which instance type?

Instance types are divided into families that are classified as general purpose (m), compute optimized (c), and memory optimized (r). By coloring the plot by optimization family as well as the series, where five is an x86_64 architecture and six is an arm64 architecture, we can see that while there is differentiation by optimization for the five series, there does not seem to be much difference between the six series members. This might suggest that further differentiation could be achieved in the series six instance types by optimizing the compiler options.

Figure 5: Cost efficiency by instance class (c5, c6, m5, m6, r5, r6).

What is the most cost-effective instance type?

The most cost-effective instance type turns out to be the m6g.8xlarge with a mean runtime of 258 seconds and a mean run cost of $0.088* (for the EC2). The older m5.8xlarge ran for an average of 700 seconds for an average run cost of $0.298*. The most cost-effective x86_64 instance type was the r5dn.8xlarge with a mean runtime of 237 seconds and a mean run cost of $0.176*. In this case, the arm64 architecture provides optimal performance. Our analysis provides a systematic approach that lets us gather data on the optimal instance type for our workloads and stop guessing at capacity. The findings demonstrate the power of the ARM architecture for genomics workloads and how to determine the best EC2 instance type for your specific workload.

*NOTE: These prices were calculated on November 18th 2020 for us-east-1. Costs may vary based on timing, use case, and region.

Learn more about how AWS is supporting biomedical research

Once you are comfortable with how to benchmark your workloads, it is time to dive deeper into the science. Visit the Genomics on AWS webpage for secondary and tertiary genomic analysis solutions.

For more information on how AWS helps solve complex research workloads and enables scientific research, see the AWS Research and Technical Computing webpage and access our library of AWS Education: Research Seminars. For NIH funded research projects request more information about the NIH STRIDES Initiative.