AWS Public Sector Blog

Taking COVID in STRIDES: The National Center for Biotechnology Information makes coronavirus genomic data available on AWS

Amazon Web Services (AWS) and the National Institutes of Health’s (NIH) National Center for Biotechnology Information (NCBI) announced the creation of the Coronavirus Genome Sequence Dataset to support COVID-19 research. The dataset is hosted by the AWS Open Data Sponsorship Program and accessible on the Registry of Open Data on AWS, providing researchers quick and easy access to coronavirus sequence data at no cost for use in their COVID-19 research.

Centralizing coronavirus data in the cloud

The Coronavirus Genomic Sequence Dataset is a focused set of researcher-submitted next-generation sequence data (original file format) as well as SRA-processed sequences (ETL file format) hosted by NCBI at the National Library of Medicine (NLM). This dataset is part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative—a collaboration between AWS and the NIH to explore the cloud as a sustainable and scalable solution for researchers’ storage and compute needs. By leveraging the STRIDES Initiative, NIH and NIH-funded institutions can begin to create a robust, interconnected community that breaks down silos related to generating, analyzing, and sharing research data. NIH-funded researchers with an active NIH award can take advantage of the STRIDES Initiative for their NIH-funded research projects.

Why every coronavirus genome matters

Coronavirus genomic sequence data is important to understanding and responding to the current pandemic and future pandemics. For example, genetic sequence differences between SARS-CoV-2 strains isolated from different individuals might shed a light on how quickly the virus is evolving and what impacts it might have on symptom severity and disease progression (although recent studies indicate that the genetics of the individual also plays a role in how they react to a COVID-19 infection).

Comparing the viral sequence isolated from patients in different geographic regions could also help make diagnostic testing for COVID-19 more accurate. In addition, identifying genetic differences between SARS-CoV-2 and other betacoronaviridae provides insights into how SARS-CoV-2 affects the host biology. For example, key differences in the COVID-19 genome likely contribute to COVID-19’s unique affinity for a specific cell surface receptor-ACE2, using it as its gateway into lung cells.

“Containing COVID-19 outbreaks and preparing for future pandemics will require a deep understanding of the SARS-CoV-2 genome in the context of other COVID-19 patients and the broader Coronaviridae family,” said Ryan Layer, assistant professor at the University of Colorado Boulder’s BioFrontiers Institute. “The NCBI Coronavirus Genome Sequence Dataset makes over a decade of viral genome data publicly accessible for researchers, empowering anyone in the research community to participate in the pandemic response.”

Explore the NCBI Coronavirus Genomic Sequence Dataset

The dataset is publicly accessible and divided into two buckets. The first bucket contains raw and normalized files categorized by SRA accession code (s3://sra-pub-sars-cov2). Not sure what accessions you are looking for? A second bucket containing accession metadata (s3://sra-pub-sars-cov2-metadata-us-east-1), is in progress, soon to be queryable with Amazon Athena.

Below, we provide steps to access the dataset directly from Amazon Simple Storage Service (Amazon S3) using the Amazon Command Line Interface (CLI). If you do not have the AWS CLI set up yet, follow these instructions.

Once you have the AWS CLI downloaded, you can list bucket contents using the ls command.

aws s3 ls s3://sra-pub-sars-cov2 --no-sign-request
README.txt
PRE run/ #these are researcher-submitted accessions
PRE sra-src/ #these are SRA-normalized accessions

At the highest level, this bucket is organized into researcher-submitted ( run/ ) and normalized ( sra-src/) data. If you dive into the src/ folder, you’ll see additional folders organized by accession code.

aws s3 ls s3://sra-pub-sars-cov2/sra-src/ --no-sign-request
PRE SRR9967741/
PRE SRR9967743/
PRE SRR9967744/
PRE SRR9968565/
PRE SRR9968569/
PRE SRR9971528/
PRE SRR9972576/
PRE SRR9982828/
.
.
.

Listing the contents of each accession folder reveals the raw data available.

aws s3 ls s3://sra-pub-sars-cov2/sra-src/SRR9967744/ --human-readable --no-sign-request
2020-05-29 15:19:13   20.3 MiB cs062.R1.fastq.gz
2020-05-29 15:19:10   21.6 MiB cs062.R2.fastq.gz

Learn more about how AWS is supporting research

Once you are comfortable navigating this dataset, it’s time to dive deeper into the science. Visit Genomics on AWS for secondary and tertiary genomic analysis solutions.

Visit the AWS Diagnostic Development Initiative and the COVID-19 HPC Consortium for COVID-19 research support.

For more information on how AWS helps solve complex research workloads and enables scientific research, see the AWS Research and Technical Computing webpage.

Request more information about the NIH STRIDES Initiative.

Erin Chu

Erin Chu

Erin Chu is the life sciences lead on the Amazon Web Services (AWS) open data team. Trained to bridge the gap between the clinic and the lab, Erin is a veterinarian and a molecular geneticist, and spent the last four years in the companion animal genomics space. She is dedicated to helping speed time to science through interdisciplinary collaboration, communication, and learning.

Ankit Malhotra

Ankit Malhotra

Ankit Malhotra is the biomedical research lead on the Amazon Web Services (AWS) research team. At AWS, Ankit helps lower the barrier for biomedical researchers to build solutions and do their research using cloud computing. With cross training in computer science, molecular biology, and genetics, he has over 10 years of experience as a NIH-funded computational genomic scientist.