Our collaborators are asking us for the data to be processed as quickly as possible, so they can analyze cancer samples against other cancer samples in the database. Using AWS, we can get the results to them in days instead of months, which could contribute to faster disease diagnoses. 
Benedict Paten Director, Computational Genomics Lab

Researchers at the University of California Santa Cruz Genomics Institute are in a race against time. Every day, children and adults suffering from cancer desperately wait for potentially lifesaving treatments. Many of these treatments are guided by genomic analysis, as conducted at the Genomics Institute.

For example, the Institute’s Treehouse Childhood Cancer Initiative analyzes pediatric cancer genomic data and provides analysis to clinical partners who treat young patients desperate for help. “We take molecular pathways indicated by our analysis and try to match this information with clinical trials and new drugs that could help patients,” says Isabel Bjork, director of pediatric cancer precision medicine at the Treehouse Childhood Cancer Initiative. “Because these patients are out of options, it’s critical that we react quickly.”

But reacting quickly was challenging for the Treehouse Initiative, and for the UC Santa Cruz Genomics Institute as a whole. That’s because the Institute’s Computational Genomics Lab needed better, faster technology for conducting analysis of RNA sequencing (RNA-seq) data to show the presence of particular RNA molecules in biological samples. RNA—ribonucleic acid—is one of the three major molecules essential for all known forms of life. “We wanted to analyze individual cancer samples in as much detail as possible, which involved going back and studying previously available data,” says Benedict Paten, director of the Computational Genomics Lab at the UC Santa Cruz Genomics Institute. “One of the problems was that much of the previous data had been computed by different institutions using different genomic-processing pipelines. We really wanted to look at the entire data set as a whole, but that required a level of processing capacity and performance we didn’t have.”

Although the Institute considered using its own big-data processing solution to tackle the problem, it soon realized the limitations of that approach. “It would have taken at least three months to process all the data with our in-house solution, and that’s just not fast enough,” Paten says. For instance, the Treehouse Initiative specifically needs to quickly analyze cancer data in its reference compendium. “The compendium is constantly growing, and we’re always trying to find ways to gather data faster,” Bjork says. “However, genomic sequencing and data transfer have always been the holdup. We often just have a few days to get analysis to a clinical partner."

To meet its challenges, the Institute’s Computational Genomics Lab created Toil, a portable, open-source software solution for running scientific workflows efficiently, securely, reproducibly, and at large scale. The lab chose to run Toil in the cloud, using the Amazon Web Services (AWS) Cloud platform. “Over time, the commercial cloud has become very stable and reliable, and AWS is no exception,” says Brian O’Connor, technical director of the analysis core at the UC Santa Cruz Genomics Institute. “We were very confident that AWS offered the scalability and cost efficiency we were looking for.” Specifically, AWS gives the Institute much more scalability. “The scalability of AWS is incredible, compared to the SFTP server we used before, which was a major bottleneck,” says O’Connor. “Instead of the limited storage environment we used to have, we get massive storage scalability now.” 

Toil runs on Amazon Elastic Compute Cloud (Amazon EC2) instances for compute, and the Institute uses Amazon Simple Storage Service (Amazon S3) to store multiple petabytes of genomic data. The lab is also taking advantage of Amazon EC2 Spot Instances, which provide the opportunity to bid on spare, often less expensive Amazon EC2 capacity. Additionally, the lab is utilizing the Amazon EC2 Container Service (Amazon ECS) to manage its Docker container environment.

To demonstrate the Toil solution, the Institute processed more than 20,000 RNA-seq samples—primarily adult samples—to create a consistent meta-analysis of five data sets, free of computational batch effects. “We created a system that matches tasks by their compute needs, and we saw a lot of efficiency by sharing and distributing those tasks across 32,000 AWS cores,” says Paten.

The Institute was able to analyze nearly all the samples in less than four days, using the AWS cluster. “Doing this on our in-house compute cluster would have taken at least three months to complete, and we would have been contending for resources with other researchers at the Institute,” says Paten. “By collocating compute and storage, we eliminated Internet-transfer and private-server bottlenecks. That’s what made this project possible.”

By performing RNA-seq analysis on AWS, the Institute was able to more quickly get results to collaborators, such as the Treehouse Childhood Cancer Initiative. “Our collaborators are asking us for the data to be processed as quickly as possible, so they can analyze cancer samples against other cancer samples in the database,” Paten says. “Using AWS, we can get the results to them in days instead of months, which could contribute to faster disease diagnoses.” 

The UC Santa Cruz Genomics Institute also saw significant cost savings by using AWS for its RNA-seq analysis. For example, if the Institute had used its original RNA-seq pipeline to analyze the constructed data set, it would have cost about $800,000. However, running Toil on AWS and employing algorithmic efficiencies, the Institute only spent $26,000 to analyze the data. “There is no way we could have afforded to run the original pipeline on a system where we paid for compute costs,” Paten says. “We would have spent around $800,000, which we obviously didn’t want to do. Using AWS to support Toil enabled us to do this analysis in a short amount of time, at the lowest possible cost, which is game-changing for our organization.” 

For the Institute’s Treehouse Childhood Cancer Initiative, using the AWS Cloud will enable more than just faster analysis—it could eventually impact patients’ lives by helping researchers get research analyses to clinical partners faster. Bjork cites the example of a pediatric cancer patient named Kelvin, who was waiting for a treatment that could improve his life. “Clinicians believe that Kelvin got an extra two years of life based on the analysis we provided,” says Bjork. “In fact, Kelvin’s story is the one that first made us believe that what we’re doing here can be successful.” But Bjork believes that the organization could make an even bigger difference if it had faster access to cancer data. We see enormous potential for the future using AWS.”

As its researchers work to analyze massive genomic data sets in the cloud, the UC Santa Cruz Genomics Institute will continue driving innovative research. “We can better understand how a cancer sample is reacting to a particular pathology, for example, by using Toil on AWS,” Paten says. “We also look forward to doing more genomic characterizations to understand mutational characteristics in individual patients, as well as optimizing the platform for DNA sequencing. We will be able to do all these things faster and more cost effectively by using the AWS Cloud.” 

Learn more about how AWS is used for genomics research.