AWS for Industries

NIH’s Sequence Read Archive, the world’s largest genome sequence repository: Openly accessible on AWS

AWS and the National Library of Medicine’s (NLM) National Center for Biotechnology Information (NCBI) are happy to announce that the Sequence Read Archive (SRA) – one of the world’s largest repositories of raw next generation sequencing data, will be freely accessible from Amazon S3 via the Open Data Sponsorship Program (ODP). The SRA is currently hosted by NLM at the National Institutes of Health (NIH). As we publish this blog, the transition to the ODP is under way.

What is the SRA?

Established in 2009 as part of the International Nucleotide Sequencing Database Collaboration (INSDC), the SRA is the NIH’s primary repository for raw next generation sequencing data. Currently, the SRA hosts over 36 petabytes of sequence data representing controlled- and public-access data dating back to 2007, and representing sequencing from over 9 million experiments. It is commonly the first stop for scientists looking to validate a research discovery, expand their effective sample population, or test out a new pipeline. In fact, the SRA website saw 1.2 million visitors in 2019 alone. These visitors reflect the changing landscape of genomics; the SRA working group reported that 20% of IP addresses come from cloud-based virtual machines such as those available with Amazon EC2.

With the power of cloud compute, bioinformaticists now have the capacity to analyze the SRA at a comprehensive scale. For example, Serratus, an open science project for rapid discovery of novel and existing coronaviruses, used AWS Batch to call and align over 4 million SRA accessions in parallel for coronavirus sequence.

At the rate it is growing now, the SRA is expected to double every 12-18 months, presenting new challenges for efficient storage and accessibility. To that end, NIH released a Request for Information (RFI) from the biomedical research community to provide input on next steps for the future of the SRA. A major theme of this RFI is to reduce the data footprint of the SRA by eliminating base quality scores (BQS). Results from this RFI are expected to be published in early 2021. In the meantime, steps have been taken to make the SRA even easier to access and use by researchers everywhere.

The new normal

Moving the SRA to the Open Data Sponsorship Program (ODP) provides an avenue to retain BQS, while reducing the complexity by which researchers can locate and retrieve SRA data. AWS users will be able to simply use Amazon Athena to query the publicly accessible SRA metadata bucket s3://sra-pub-sars-cov2-metadata-us-east-1 for accessions of interest, or directly interrogate the SRA bucket for a specific SRA submission or set of submissions, then call it directly into a cloud-based genomics workflow.

Direct access to SRA data as S3 objects will enable more scalable and cloud native tooling for processing and analyzing genomics datasets. Rather than copying terabytes or even petabytes of data into their own environments, researchers can use SRA in S3 as a single source of truth for their analysis, subsequently making workflows more reproducible and amenable to global research collaborations.

“Having a single location of sequencing data with complete Base Quality Scores (BQS) is essential for continued development of new and novel methods for genomic analysis,” says Benedict Paten, Associate Professor and Associate Director UC Santa Cruz Genomics Institute. “We are very interested in the continued support of SRA, and I am glad that the data with full BQS would be available for use by the research community through the AWS Open Data Program.”

The NLM also maintains two additional S3 buckets hosted by the Open Data Program (allocated under the NIH STRIDES agreement) that will be used specifically for raw data for newer sequencing technologies such as Pacific Biosciences, Oxford Nanopore, and 10X Genomics, and newer submissions as space allows.

The SRA will join a growing list of key biomedical and genomics datasets such as TCGA, ICGC, Gabriella Miller Kids First, ENCODE, gnomAD, Human Microbiome Project and Human PanGenomics Project that have been provided to the research community free of charge on the AWS platform. Currently, 61 biomedical and genomics datasets are listed in the Registry of Open Data, composing over 9 PB of data.

Get ready for SRA data on AWS

While the SRA is just beginning its transition to the ODP, NLM has already made 250 TB of coronavirus genome sequence data available on AWS ODP; the overall structure of this bucket will echo that of the larger SRA dataset. Get started with this data on AWS—find more on our blog post.

The NLM has also recently held a webinar that demonstrates how to query SRA metadata with Amazon Athena using the same coronavirus genome sequence dataset. See NCBI Minute: SRA in AWS Athena for SARS-CoV-2 Research and More.

Finally, in collaboration with AWS Educate Research Seminar Series, we will be digging deeper into the open data ecosystem on AWS, showcasing the SRA, this Thursday, Jan 28th at 9a PT/12p ET. Register for the seminar here. And if you can’t make it, not to worry: this and all other research seminars are available on demand here.

We look forward to continuing our collaboration with NIH and NLM to increase access and utility of this invaluable resource. Stay tuned for additional resources and releases in coming months.

Learn more about how AWS is supporting research

Ready to start incorporating SRA data into your cloud workflows? Explore AWS genomics solutions.

For more information on how AWS helps solve complex research workloads and enables scientific research, see the AWS Research and Technical Computing webpage.

Read more information about the NIH STRIDES Initiative.

Learn more about the AWS Open Data Sponsorship Program (ODP)

The AWS Open Data Sponsorship Program covers the cost of storage for publicly available high-value cloud-optimized datasets. We work with data providers who seek to:

  • Democratize access to data by making it available for analysis on AWS
  • Develop new cloud-native techniques, formats, and tools that lower the cost of working with data

Encourage the development of communities that benefit from access to shared datasets.

Erin Chu, DVM, Ph.D.

Erin Chu, DVM, Ph.D.

Erin Chu is the life sciences lead on the Amazon Web Services (AWS) open data team. Trained to bridge the gap between the clinic and the lab, Erin is a veterinarian and a molecular geneticist, and spent the last four years in the companion animal genomics space. She is dedicated to helping speed time to science through interdisciplinary collaboration, communication, and learning.

Ankit Malhotra

Ankit Malhotra

Ankit Malhotra is the worldwide genomics lead on the Amazon Web Services (AWS) Public Sector healthcare team. At AWS, Ankit helps healthcare and biomedical research customers in the public sector integrate genomics into their workloads, helping them accelerate and innovate using the AWS Cloud. With cross training in computer science, molecular biology, and genetics, he has over 10 years of experience as a NIH-funded computational genomic scientist.

Lee Pang

Lee Pang

Lee is a Principal Bioinformatics Architect with the Health AI services team at AWS. He has a PhD in Bioengineering and over a decade of hands-on experience as a practicing research scientist and software engineer in bioinformatics, computational systems biology, and data science developing tools ranging from high throughput pipelines for *omics data processing to compliant software for clinical data capture and analysis.