AWS Government, Education, & Nonprofits Blog

Time to Science, Time to Results: Transforming Research in the Cloud

Scientists, developers, and many other technologists are taking advantage of AWS to perform big data analytics and meet the challenges of the increasing volume, variety, and velocity of digital information.  We sat down with Angel Pizarro, member of the Scientific Computing team at AWS, to talk about how the cloud is transforming genomic research.

Prior to joining AWS, Angel was a bioinformatics researcher (a data scientist focused on biological models and systems). In addition to his own research, Angel ran infrastructure for other researchers at a university. Back in 2006, he had an idea for an experiment, but at first glance it would take more RAM than available on the university’s compute cluster. Upgrading the RAM would have cost more than $40,000 for just this one experiment. They turned to Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2) where they could access enough RAM for the short period of time required to test out the idea.

It is a good thing they decided to use AWS, because the experiment didn’t work as expected, but AWS did. The moral of the story is: they tried! AWS allowed them to experiment at a reasonable cost. The ability to experiment became a great driver for change in Angel’s own research, and across the entire genomics field.

“When we calculated the compute and storage that we needed for the sequencers on campus, we found that we only had about 20% of the needed capacity. Second, even if we had unlimited funds to expand the compute and storage infrastructure, we didn’t have the real estate to put the equipment into.” Angel said. “We asked ourselves, ‘what is our real need?’ and the answer was a reactive compute resource that scales based on just-in-time data production. By moving workloads to AWS during peak times, we were able to service our researchers and not slow down the science.”

A lot has changed in the last decade within cloud computing and genomics. The sequencing instrumentation kept improving to output more data at much lower price points. The combination of more sequence at cheaper prices resulted in a virtuous cycle: prices fall, driving more people to use genomics for their research, thus driving more price drops as economies of scale kicked in.

Reducing time to science

The type of questions researchers could ask were largely dependent on the amount of compute they could get their hands on. Prior to the cloud, researchers were limited to three choices when it came to compute-intensive research:

  1. If you had the money to buy big compute clusters, you often had unused infrastructure, which is a waste of money.
  2. If you had no money, you would request access on a shared cluster, often waiting in a large line for the resources to become available.
  3. Or you would just forego the initial question and ask another question.

The cloud breaks this mold by giving immediate and temporary access to an unlimited amount of compute power, and allows you to ask questions that may not have otherwise been possible. And having more computation allows you to ask even better questions about data.

Reducing time to science is something every researcher should experience. Nothing is sweeter than that first moment when you launch a HPC cluster in ten minutes. Once that light-bulb moment happens, you quickly start to realize that you can launch many clusters and perform parallel analyses of the same data set.

The second part of accelerating science is sharing results. In the cloud, everyone is able to use the same tools, language, and security that you did. More than just sharing a manuscript or a script to go along with your data, virtual infrastructure allows you to share the code that created your entire environment. If you have ever tried to install and use someone else’s badly documented code, you know what a big a deal this is.

Another goal of this approach is to democratize science by putting petabyte-scale data sets and 10,000 core clusters within the reach of researchers at institutions that may not be able to afford to buy something for local installation. When you can temporarily utilize massive amounts of compute, you lower the bar of entry for researchers.

Science shared securely

Within the scientific community, security is discussed in the context of data security, as in who has access to it and when. With AWS, you are able to provide standard operating procedures (SOPs) and share them with other researchers. There’s also a template so you and other researchers can meet these controls. That’s a powerful model – you have guidance and provide the steps.

Sharing findings allows researchers to rely on more data to help get to where they want to be in their own research. For example, researchers at Johns Hopkins University are developing a new algorithm on top of Amazon Elastic Map Reduce (EMR) to analyze all public RNA-Seq data in public repositories. The system actually gets cheaper the more data you give it. It works directly off of Amazon S3 to read input data, store results, and takes advantage of the Amazon EC2 Spot pricing. Amazon EC2 Spot instances allow you to bid on spare Amazon EC2 computing capacity, significantly reducing the cost of running your applications, growing your application’s compute capacity and throughput for the same budget, and enabling new types of cloud computing applications. By being able to analyze all of the public data at a reasonable cost, Johns Hopkins found new insights into how genes are spliced together, resulting in the formation of proteins and cells. They discovered evidence for over 58,000 new pieces to that enormous jigsaw puzzle, our body, all without ever having to worry about the size of the infrastructure. They just needed to ask their big research question and access pay-as-you-go infrastructure to answer it.

“There is a large consortium of data, because the human body is complex. But what we know about  human biology is low hanging fruit. There is so much more out there, if we can share the data we have across different groups. The hope of the cloud model is to really start understanding human biology and make strides in research that impacts the world,” Angel said.

Learn more about AWS and genomics in this post by Angel and Jessica Beegle on How the Healthcare of Tomorrow is Being Delivered Today and visit the AWS Genomics in the Cloud page.