How Caris Life Sciences processed 400,000 RNAseq samples in 2.5 days with AWS Batch

This post was contributed by Greg Desmarais (Caris Life Sciences), Christian Frech (Caris Life Sciences), Anuj Patel, Mark Azadpour, Yusong Wang

When you’re processing genomic data for cancer patients, you can’t waste time. Caris Life Sciences, a leading next-generation AI TechBio company and precision medicine pioneer, faced this reality head-on when they needed to process whole transcriptome sequencing data from more than 400,000 cases to fuel research projects.

What typically took months in the past on traditional infrastructure, Caris accomplished in just 2.5 days using AWS. This breakthrough enabled them to analyze 23,000 RNA genes per sample while managing their extensive 40+ PetaByte multimodal database.

In this post, we’ll take you behind the scenes of this remarkable achievement. You’ll learn how Caris used a range of AWS services, like AWS Batch, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances, and AWS HealthOmics Sequence Store to build a highly scalable solution that processed hundreds of thousands of samples while maintaining cost efficiency through strategic use of Spot Instances.

The opportunity

Caris calculated that running RNA sequencing analysis for 400,000 samples on their conventional on-premises infrastructure – with its finite capacity – would have taken roughly 3 months.

For a company at the forefront of precision medicine, this timeline wasn’t just a number – it represented delayed insights that could impact cancer research and business opportunities. Caris needed to dramatically accelerate their processing capabilities without losing cost efficiency.

The solution

Rather than using an off-the-shelf RNAseq analysis pipeline based on NF-core, Caris crafted a custom solution that married the power of Nextflow with AWS Batch and Amazon EC2. Their infrastructure scaled to use approximately 200,000 concurrent Amazon EC2 Spot vCPUs distributed across multiple Availability Zones. The solution primarily relied on EC2 instances from various families: general-purpose (M-type), compute-optimized (C-type) and memory-optimized (R-type). At peak capacity, the infrastructure scaled to 4,000 instances while maintaining low costs by using the Spot Instance optimized allocation strategy in AWS Batch.

To ensure optimal performance, Caris also implemented a gradual-scaling strategy. They began with manageable batches of 100 samples, gradually increasing to 500, then to 1,000 samples – running in parallel. This methodical approach proved its worth during their initial test run, where they successfully processed 10,000 samples using 30,000 vCPUs in just 10 hours.

A dedicated Nextflow submission managed each genomic flow cell, containing approximately 100 RNA samples. Every sample required between 10 to 20 tasks, executed across 5-10 Docker containers with specialized bioinformatics tools. The computational requirements varied significantly – some tasks completed in minutes while others ran for up to 4 hours.

Resource requirements were equally diverse, with vCPU needs ranging from 1 to 64 cores (averaging around 24) and memory requirements spanning from 4 GB to 64 GB. E.g. the STAR alignment component demanded high memory utilization of 50 GB, while the Quality Control (QC) components operated efficiently with lower memory and CPU allocations.

One crucial element that accelerated their processing was switching from submitting individual AWS Batch jobs to array jobs, which helped overcome transaction-per-second (TPS) limits they encountered. This feature significantly improved job submission throughput and task execution efficiency. Another crucial element for achieving the scale and speed of analysis was storing their FASTQ files in AWS HealthOmics Sequence Store, which provided a solid foundation for their processing pipeline.

Figure 1 – Architecture diagram showing how Caris leveraged AWS Batch orchestration. The workflow leverages AWS HealthOmics Sequence Store for FASTQ files, distributes processing across Amazon EC2 Spot vCPUs, stores bioinformatics container images in Amazon Elastic Container Registry (ECR) and output files in Amazon Simple Storage Service (S3). Nextflow coordinates the pipeline execution while AWS Batch optimizes job submission and scaling with array jobs. Batch runtime monitoring via custom a Amazon CloudWatch dashboard enables resource optimization across this massive parallel processing environment.

Caris also implemented the AWS Batch Runtime Monitoring solution, which provided crucial metrics and insights about job execution patterns and resource utilization. This is an open-source monitoring framework which became essential for managing their large-scale workload. It helped them track job states, identify bottlenecks, and optimize resource allocation across their extensive processing pipeline.

Challenges

Scaling to this magnitude required careful attention to various technical limits and potential bottlenecks. The team worked closely with AWS to increase their Amazon EC2 Spot vCPU limit and expand their Amazon Elastic Block Storage (EBS) capacity to 800 TiB. They encountered and solved several interesting challenges along the way.

For instance, when they hit API rate limits while querying Spot Instance requests with DescribeSpotInstanceRequests calls, they implemented a solution using instance tagging to track costs without overwhelming the EC2 API.

Storage management became critical as the project consumed a staggering 18 PetaByte in S3 storage. They optimized their Amazon Simple Storage Service (Amazon S3) access patterns by implementing different prefixes at the top level to address potential bottleneck issues using our best practices guide.

The team also faced interesting challenges with Docker container cleanup during high-throughput operations. They resolved this by fine-tuning their Amazon Elastic Container Service (ECS) configuration parameters and upgrading from GP2 to GP3 volume types for better I/O performance.

The AWS HealthOmics Sequence Store played a vital role, but needed an increase in the GetReadSetMetadata API throughput limit to 100 TPS. The system successfully handled peak throughput of 60 GB/s, maintaining an average of 10-15 GB/s through access points.

And we addressed job-level error handling and reliability using automatic retries for AWS Batch jobs.

Conclusion

By using AWS, Caris Life Sciences transformed a months-long computational challenge into a matter of days. In doing so, they dramatically accelerated their ability to derive insights that can fuel clinical cancer research.

This achievement demonstrates the immense potential of AWS cloud computing in life sciences, particularly for organizations dealing with large-scale genomics workloads. The success of this project opens new possibilities for accelerated research and enhanced patient care through efficient data processing.

Are you facing similar challenges with large-scale genomic processing? The AWS Healthcare and Life Sciences team can help you explore solutions tailored to your needs. Reach out to your AWS account team to start a conversation about accelerating your genomic workflows.

AWS HPC Blog

How Caris Life Sciences processed 400,000 RNAseq samples in 2.5 days with AWS Batch

The opportunity

The solution

Challenges

Conclusion

Resources

Follow

Learn

Resources

Developers

Help