Skip to main content
2025

Deploying an analysis pipeline using AWS Batch with Caris Life Sciences

Learn how biotech company Caris Life Sciences built an RNA-sequencing analysis pipeline for cancer research using AWS Batch.

Overview

Caris Life Sciences (Caris) was built on the notion that precision medicine is the future of cancer treatment. Performing molecular profiling on thousands of patient samples a month, the company helps doctors provide the best possible treatments for patients.

To fuel research for the next wave of therapies, Caris wanted to reprocess over 400,000 samples of patient RNA-sequencing data. However, the company’s existing computational pipeline was not optimized to analyze large datasets. The company had been using Amazon Web Services (AWS) for analyzing genomic data for several years, so it turned to AWS to develop a new analytics pipeline. Using high performance computing (HPC)—which helps accelerate innovation with managed HPC services and virtually unlimited infrastructure on AWS—Caris developed a scalable, cost-efficient pipeline optimized for research.
 

About Caris Life Sciences

Caris Life Sciences was founded in 2008 with the goal of using data-driven insights to realize the potential of precision medicine. It has built a large portfolio of precision medicine tools that have helped over half a million cancer patients worldwide.

Opportunity | Creating an RNA-sequencing pipeline for Caris Life Sciences

Caris was founded in 2008 with the goal of using data-driven insights to realize the potential of precision medicine. For years, the company has been developing precision diagnostic tests and targeted cancer therapies. Caris now wanted to reprocess its data for research use, both internal and external, to help develop the next generation of cancer treatments. The company chose to use RNA data for the reanalysis. “RNA data is the most quantitative data that we use,” says Noah Spies, director of computational biology at Caris. “We want to obtain as subtle a distinction between biological signals as possible.”

Caris had an existing clinical computational pipeline that collected and processed genomic data for hundreds of samples daily. However, the pipeline was optimized for clinical use, not for reanalyzing the large volume of the company’s data. Additionally, Caris needed a single version of the pipeline to yield usable results for research, but this pipeline changed every few weeks.

“We wanted to use the really rich capabilities of AWS to record the provenance of each piece of data that came through the pipeline and was an output from it,” says Spies.

To generate the best possible results, Caris needed a pipeline that was optimized to process a large amount of data. The company created a Nextflow pipeline, optimized for the reanalysis of its RNA-sequencing data, using industry best practices and publicly available tools. “We wanted to make sure that the pipeline would be understood and accepted by the community with whom we would share the data,” says Nico Stransky, corporate vice president at Caris.

Solution | Processing 400,000 samples using AWS Batch

For the pipeline framework, the company implemented AWS HealthOmics, which helps transform genomic, transcriptomic, and other omics data into insights. “Because AWS HealthOmics uses standard industry technologies like Nextflow, it was a simple transition for us to use our own Nextflow pipeline with that service,” says Greg Desmarais, senior director of software and data engineering at Caris. The pipeline was deployed on AWS Batch, which offers batch processing for machine learning model training, simulation, and analysis at any scale.

Running seamlessly under Nextflow, AWS Batch pulls the raw data in from AWS HealthOmics and processes it in multiple steps. After each step, the data is shifted to Amazon Simple Storage Service (Amazon S3)—which offers object storage built to retrieve any amount of data—and AWS Batch uses each step’s data to continue processing. The intermediate-step files are flushed from storage and the final output data is stored in Amazon S3. “Using AWS Batch offers a tremendous amount of stability and scalability,” says Desmarais. “We rely on the AWS Batch environment to process over 400,000 samples in a single run, with room to grow to millions.”

Implementing AWS Batch made it possible for the company to process its huge dataset, scaling the RNA-sequencing pipeline as needed. Traditional virtual machines linger until the end of a sequence of jobs are completed and result in suboptimal consumption charges. However, AWS Batch can provision clusters and scale up and down as needed, providing a dynamic, flexible container-based infrastructure. The scalability of the solution also helps with cost management: The company is not paying for compute resources when they are not needed—a critical need, because analyzing samples costs tens of thousands of dollars. AWS Batch’s intelligent allocation strategy facilitates rightsizing of instances for each job, which is a differentiating factor of the service. This rightsizing of instances makes it possible for Caris to use Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances to run fault-tolerant workloads for up to 90 percent off compared with On-Demand prices.

To optimize performance, Caris implemented a gradual-scaling strategy. It began with batches of 100 samples, incrementally increasing to 1,000 samples running in parallel. That methodical approach proved to be effective: During the company’s initial test run, it processed over 10,000 samples in 10 hours.

As new samples come in, they are added to the dataset. “As you continue to run pipelines, it’s important to be able to compare and understand what was done previously,” says Spies. “And that was what was missing from the clinical pipeline.”

Outcome | Improving diagnostics and therapeutic insights

Caris plans to continue refining its analysis algorithms so that it can analyze additional datasets from DNA and pathology images. The RNA-sequencing pipeline serves as a test run for the company’s goal of creating a larger pipeline to process DNA data. “We wanted to kick the tires as hard as we could with this reprocessing effort on the RNA side to gain as much understanding of the scalability issues and solutions for those additional reprocessing needs,” says Spies.

The company is dedicated to improving diagnostics and therapeutic insights for cancer patients and remains focused on the big picture. “We feel an urgency to improve the diagnostics themselves to see if there might be something that patients can use to treat their cancers now, but also to develop the next set of treatments,” says Spies. “This project is accelerating our ability to improve diagnostics as well as the therapeutic insights that will go back to the patients.”

Missing alt text value
Using AWS Batch offers a tremendous amount of stability and scalability. We rely on the AWS Batch environment to process over 400,000 samples in a single run

Greg Desmarais

Senior Director of Software and Data Engineering, Caris Life Sciences

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages