Behind the Nature Publication: University of British Columbia Identifies 130,000 New Viruses in 11 Days

2022

Biotechnology data is expanding at a rate exceeding Moore’s law, and experts are increasingly grappling with how to effectively share this data to accelerate scientific breakthroughs. Genomic data sharing leads to more accurate and deeper research insights, but its massive volume can present logistical challenges in areas like security, storage capacity, and global accessibility which can be addressed using Amazon Web Services (AWS).

To facilitate this move towards international genomic data sharing for research purposes, the National Center for Biotechnology Information (NCBI) mirrored the Sequence Read Archive (SRA) into AWS in February 2020 using Amazon Simple Storage Service (S3), an object storage service. The SRA is the world’s largest repository of high-throughput genetic sequencing information. It contains more than 50 petabases (5x10¹⁶ DNA letters) of raw data from thousands of species from all corners of the Earth, ranging from Antarctic penguins to peat bogs in British Columbia.

Artem Babaian, Ph.D., a researcher at the University of British Columbia, decided to take advantage of this open-access data to understand how the COVID-19 pandemic emerged. While billions of dollars have been invested in understanding the genome of SARS-CoV-2 – the coronavirus that causes COVID-19 – the scientific community still has a lot to learn about coronaviruses in general, such as their evolutionary history and how these viruses can undergo genetic recombination between different virus species.

Their research, published in the scientific journal Nature, should help doctors make connections faster when dealing with sick patients, improve diagnostic testing and vaccine development, and help policymakers decide where to direct their research and monitoring more effectively.

AWS Healthcare & Life Sciences Virtual Symposium 2021: UBC

kr_quotemark

Using AWS, Serratus can process over one million libraries of next-generation sequencing data per day for an overall cost of less than half a cent per library.”

Artem Babaian, Ph.D
Serratus Project Lead, UBC

Babaian realized the SRA likely contained RNA sequences from many different species of coronavirus worldwide, including some that had never been characterized before. Insights from these sequences could help the world research community combat the COVID-19 pandemic and form the basis of a surveillance network to mitigate or prevent future pandemics. To unlock these insights, he founded the Serratus project – an idea initially conceived as part of a hackathon that soon became an official research project housed at the University of British Columbia (UBC) Cloud Innovation Center.

Babaian was soon joined by an array of leading international scientists from across the world who could all collaborate on the same datasets using AWS tools. They performed a comprehensive search of ~20 petabytes of S3-mirrored SRA data using Amazon Elastic Compute Cloud (EC2) instances, a web service that provides secure, resizable compute capacity in the cloud. Given the massive efficiency gains achieved on AWS compared to on-premises systems, they could search for not only novel coronaviruses, but all RNA viruses.

“In just 11 days, we were able to identify 130,000 novel RNA viruses on AWS – that’s 10 times more than were identified over the past century of virology research,” said Babaian. “This was only possible because we could combine the open access to the SRA on AWS’ Registry of Open Data with the scalability and cost savings of the AWS Cloud.”

Optimize for Cost Efficiency

To conserve research dollars and maximize accessibility, the research team optimized the Serratus architecture for CPU efficiency and cost savings. Each analysis uses separate clusters for downloading, aligning, and merging data, and each genomic library is broken down into smaller components and processed in parallel by separate nodes. These design features make the project’s analyses fault-tolerant, allowing the researchers to use Amazon EC2 Spot Instances rather than relying primarily on on-demand instances. Spot instances are available at up to a 90% discount compared to on-demand instances.

“Using AWS, Serratus can process in excess of one million libraries of next-generation genomic sequence data per day for an overall cost of less than half a cent per library,” says Babaian.

In total, UBC’s collaboration with AWS enables the Serratus project to dynamically access over 22,250 CPUs at once with an overall CPU efficiency of 75 percent. Processing library components in parallel with this extreme efficiency, the researchers performed over 2,000 years of computing in just a couple of weeks. The Serratus project is an example of research architecture designed from the ground up to take advantage of AWS tools in ways that accelerate scientific discoveries while keeping costs low.

New Discoveries with Publicly Available Data on AWS

The Serratus project has continued analyzing genomic data and is preparing to move into phase II, where the researchers will keep building and refining the phylogenetic tree of coronaviruses and corona-like viruses to expand the field of virology through computational biology techniques.

“Having the SRA mirrored onto S3 natively allows for an unprecedented level of access to data that can be exploited with a cloud computing cluster,” says Babaian. “The internal networking on AWS is quite impressive and enables us to do things that would simply be impossible on a conventional cluster.”

All Serratus project data is freely available on the Registry of Open Data on AWS. Anyone who accesses these datasets can perform sophisticated analyses as if the information were sitting on an on-premises hard drive.

“The SRA, which itself is growing exponentially, will be the most important database in biology in the next few years,” predicts Babaian. “People don’t yet realize the planetary-scale consequences of this potential because the questions we're going be tackling in four or five years are going to be so fundamentally advanced relative to what we even imagine is possible today.”

Learn More

See how AWS for Genomics is supporting other life science organizations in their quest to improve human health.

About the University of British Columbia

The University of British Columbia’s Cloud Innovation Centre is leading the Serratus project, a collaborative open science project for ultra-rapid discovery of known and unknown coronaviruses through re-analysis of publicly available genomic data.

Benefits of AWS

● Enabled free cloud access to over 50 petabases of genomic data
● Performed 2,000 years of computing in two weeks
● Analyzed over one million genomic libraries per day at a cost
● Expanded number of known RNA viruses by 130,000

AWS Services Used

Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.

Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.

Amazon EC2 Spot Instances

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud. Spot Instances are available at up to a 90% discount compared to On-Demand prices.

Registry of Open Data on AWS

The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources.

Get Started

Companies of all sizes across all industries are transforming their businesses every day using AWS. Contact our experts and start your own AWS Cloud journey today.

Contact Sales