Preventing the next pandemic: How researchers analyze millions of genomic datasets with AWS
How do we avoid the next global pandemic? For researchers collaborating with the University of British Columbia Cloud Innovation Center (UBC CIC), the answer to that question lies in a massive library of genetic sequencing data called the Sequence Read Archive (SRA).
Via the SRA, researchers have access to millions of gigabytes of genetic sequencing data, including the DNA and RNA of hundreds of thousands of unknown viruses. But there is a problem: the data library is so massive that traditional computing can’t comprehensively analyze or process it. Driven by the urgent necessity to prevent another global pandemic, the team at the UBC CIC collaborated with computational virologists to create Serratus, an open-science viral discovery platform to transform the field of genomics—built on the massive computational power of the Amazon Web Services (AWS) Cloud.
Discovering the next novel coronavirus with big data
In the months after the pandemic began, scientists realized that if genomics researchers had seen COVID-19 coming, the world might be a fundamentally different place today. In response, the UBC CIC launched the Open Virome project, a collaborative global initiative that seeks to avoid future pandemics by identifying hundreds of thousands of previously undiscovered viruses. Computational biologist Artem Babaian, who heads the Open Virome project, believes that the key to preventing the next pandemic is knowledge—and you can’t acquire that knowledge without the ability to compute big data. “The amount of genomics data is growing exponentially every day,” Babaian says. “But our data is rapidly outpacing our processing power. Basically, we have all the information we need, but we don’t have the tools to use it.”
With that goal in mind, the researchers in the Open Virome project developed Serratus, an AWS Cloud-based tool that rapidly processes existing DNA and RNA sequencing data from the SRA. With Serratus, the researchers believe they can both identify potentially harmful new viruses and alert scientists to potential mutations in SARS-CoV-2 that could nullify herd immunity. “If SARS-CoV-2 infects a deer,” Babaian explains, “that deer and the SARS-CoV-2 virus swap spike proteins. That exchange creates a new hybrid virus which can reinfect humans. This virus wouldn’t be a variant of Covid, it would be entirely new—and potentially very dangerous.” The goal, Babaian says, is to use Serratus to discover these mutations in advance, so doctors can stop the new virus variant in its tracks.
The power of bare-bones architecture in AWS
The biggest problem for the SRA is that the dataset is so massive—and growing every single day—that it’s almost impossible to systematically analyze. That’s where Serratus comes in. Using the AWS Cloud, Babaian determined that they could quickly process millions of gigabytes of data leveraging the cloud elasticity while being cost efficient. The key to their success was keeping the cloud infrastructure as simple as possible.
“Essentially, we built the solution using foundational AWS components, which isn’t typical,” says Babaian. “Usually, people go for fancy instances. But we went for the smallest component we could reasonably work with per instance, because our goal was to process as much data as we could, as quickly as we could, for as little money as possible.”
To build Serratus, the team mirrored the SRA database in Amazon Simple Storage Service (Amazon S3), and then used Amazon Elastic Compute Cloud (Amazon EC2) instances to analyze the dataset. To make sure their findings were reliable, they leveraged parallel processing of very small amounts of data. This guaranteed accuracy and scalability as they increased the amount of data processed per minute. The team then worked to optimize the Amazon EC2 instances so they were as cost-effective as possible. Babaian aimed to pay less than one cent to process each sequencing dataset—by the time they were finished, he had surpassed that goal, with the team paying less than half a cent per instance while processing one million sequencing datasets per day.
Once the solution was optimized and ready for action, the team put it to the test. In only 11 days, Serratus processed a staggering 5.7 million sequencing datasets—for only $24,000. From that data, the team discovered 130,000 new RNA viruses. When you compare this to traditional processes, the results are astounding. Previously, scientists discovered only 15,000 viruses after decades of data analysis, and it was common to spend hundreds of millions of dollars on studies to find a few thousand new viruses. Using bare-bones AWS architecture, the Open Virome team saves the scientific community millions of dollars and years of time in discovering new viruses.
Preventing pandemics with ultra-optimized data processing
With the tools in place to rapidly process and analyze sequencing data, the Open Virome project is now turning its attention toward real-time pandemic prevention. “We are now looking into automating annotation of the datasets, so we can give meaning to these unknown viruses,” Babaian says. “Our goal is to create a quick analysis tool that can link a patient with an unknown virus to its epidemiology using SRA data. We want the epidemiology to write itself.”
Ultimately, however, the Open Virome is about more than just preventing pandemics. “These databases are becoming a historic record of our biodiversity across the planet,” Babaian notes. “We are trying to capture the whole arc of genetic history—and that research potential is enormous.” And this information can help scientists across the world. All Open Virome data is immediately made available at serratus.io and AWS Open Data repositories. The AWS open-source tooling means that any organization can take advantage of the Open Virome dataset. “This is truly a community project,” Babaian says. “And community is the key to our success.”
Read more about the Open Virome project and Serratus’ results in this paper published in Nature, co-authored by Babaian.
To learn more about how AWS is supporting public sector projects that are transforming the world through AWS Cloud Innovation Centers, visit the AWS Cloud Innovation Center hub.
Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.
Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.