Building a resilient and scalable clinical genomics analysis pipeline with AWS

This is a guest post from Noora Siddiqui, a life scientist turned cloud engineer at the Baylor College of Medicine Human Genome Sequencing Center.

Healthcare is not one-size-fits-all. At the Baylor College of Medicine Human Genome Sequencing Center (BCM HGSC), we aim to advance precision medicine and research in genomics. In that effort, we joined the ambitious All of Us Research Program funded by the National Institutes of Health (NIH) to help deliver genomic data to over one million individuals across the United States. Researchers use the data to learn how our biology, lifestyle, and environment affects health. These insights will help them find ways to treat and prevent disease. Supporting this research required the rapid analysis of petabytes of patient genomic information.

In early 2019, we estimated that processing whole genome samples for this megaproject would imply a scale-up of over four times the production workload of our center. We used Amazon Web Services (AWS) to support our new pipeline demands, which saved time, reduced costs, and created new opportunities for future development.

The big data challenge of genomic sequencing

When genomic DNA is collected in the lab via a blood or saliva sample, it goes through a variety of laboratory procedures before it is loaded onto a sequencer and produces hundreds of gigabytes of raw data. A multitude of algorithms transform this raw data and weave together a picture that physicians can use to determine if there are any clinically significant findings. A single sample will produce hundreds of files and over 150 gigabytes of transformed data.

To support clinicians to intervene quickly with precision care, rapid turn-around of clinical samples is crucial, but our software required several days for a single sample to run through the data analysis steps we refer to as the pipeline. To process clinical samples at the rate and scale needed, our team migrated to a hybrid-cloud solution and engineered a resilient, automated clinical genomics pipeline with the flexibility to handle burst workloads in a secure manner.

Developing our bioinformatic analysis pipeline to save time and more

To decrease turn-around time, we tested different hardware and software-based pipeline accelerators for sequencing analysis. We utilized Illumina’s DRAGEN (Dynamic Read Analysis for GENomics) platform and reduced the time for bioinformatic analysis from 80 hours to two hours. To support the accuracy, precision, and sensitivity of these results, we then began a meticulous process of bioinformatic validation involving thousands of pipeline runs. Through this rigorous analysis of alignments, variant calls, and metrics, we optimized Illumina’s technology for use in our unique clinical setting. By utilizing Illumina’s DRAGEN platform in AWS, we also scaled our solution and quickly determined that we could address our problems of time, storage, scalability, and cost by migrating to the cloud.

AWS services and event-driven workflows help cut costs

Samples come into the lab and off sequencers in an unpredictable manner. Different pipeline components can take anywhere from 10 minutes to four hours to complete analyses. AWS Batch helped us to handle these bursts and provision thousands of instances in parallel, automatically scaling down during times of low usage. Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances helped us to drive the efficiency of our solution further in the face of varying compute times and instance costs.

To orchestrate workflows, we engineered an event-driven architecture that responds to the arrival of new sample data in Amazon Simple Storage Service (Amazon S3). This triggers an Amazon Simple Notification Service (Amazon SNS) topic that leads to a cascade of AWS Lambda functions that identify the file type and submit relevant AWS Batch jobs.

The flexibility of this decoupled, microservices architecture allowed us to more simply develop and integrate further software components into our existing workflow. We were also able to use AWS Lambda to archive Amazon S3 data by tag, leading to immense cost-savings of over 95% on storage and automated archival of clinical data in accordance with compliance protocols. This workflow processes thousands of clinical samples at a cost (for automation) that’s less than a cup of coffee every month.

Designing for scalability, cost observability, and more

When samples first began to pour in, we processed over 1,000 a month. In early 2021, this number quadrupled to over 4,000 samples per month and continued to grow until the All of Us production workload was four times greater than our production workload at any other point in the center’s history. The dynamic scalability of our solution handled the larger bursts without any additional development effort on our part.

With Amazon CloudWatch, we track logs and metrics from almost half a million jobs per month and set appropriate alarms to investigate disruptions. We also utilize Amazon EventBridge rules in conjunction with AWS Lambda to automatically handle job failures and resubmissions.

In terms of cost observability, our center designed an extensive tagging strategy that gave us the power to scrutinize the cost fingerprint of different jobs, projects, and users. We now improve our work processes and prioritize development based on these granular cost breakdowns.

AWS CodePipeline, AWS CodeBuild, GitHub, and AWS CloudFormation allow us to more simply build, version, and release our cloud infrastructure along with all software components. We also utilize TaskCat to build our cloud infrastructure in a test Amazon Virtual Private Cloud (Amazon VPC) before merging successful changes to our development environment. Releases to our staging and production environments happen in much the same way.

Beyond testing the validity and security of our environment, we conduct rigorous bioinformatic validation with every minor software or infrastructure change. Consistency is essential and these tests support the unimpaired, safe, and consistent delivery of quality results to patients and clinicians.

Looking to the future of clinical genomics sequencing

Our experience encouraged us to take more clinical projects directly to the cloud. Now, we design workflows using AWS Step Functions, with service integrations that allow for the creation of complex workflows in healthcare and life sciences. We also track samples via Amazon Athena and Amazon QuickSight and use a wider range of Amazon Elastic Compute Cloud (Amazon EC2) instance types and technologies within our multiple accounts.

In the context of genomics workflows, the combination of AWS Step Functions with AWS Batch and AWS Lambda constitutes a robust, scalable, and serverless task orchestration solution. Though the journey to migrate to the cloud was initially daunting, the resulting solution sparked even more opportunity and creativity for future development.

For more information about how AWS can enable your genomics workloads, be sure to check out the AWS Genomics page. Do you have any questions about how AWS can help support genomics data analysis? Reach out to the AWS Public Sector Team for more.

AWS Public Sector Blog

Building a resilient and scalable clinical genomics analysis pipeline with AWS

The big data challenge of genomic sequencing

Developing our bioinformatic analysis pipeline to save time and more

AWS services and event-driven workflows help cut costs

Designing for scalability, cost observability, and more

Looking to the future of clinical genomics sequencing

Read more about AWS for genomics:

Resources

Follow

Learn

Resources

Developers

Help