AWS for Industries
Broad Institute gnomAD data now accessible on the Registry of Open Data on AWS
Co-authored by Grace Tiao, Associate Director of Computational Genomics at the Broad Institute
and Erin Chu, DVM, Ph.D., Life Sciences Lead, AWS Open Data Program
Today we announce that data from the Genome Aggregation Consortium (gnomAD) is available for the first time on Amazon Web Services (AWS) as part of the Registry of Open Data on AWS. gnomAD is the world’s largest public collection of human genetic variation and a near-ubiquitous resource for basic research and clinical variant interpretation. It is used in virtually all clinical genetic diagnostic pipelines worldwide, with over 20 million page views of the Broad Institute website to date.
The entire trove of gnomAD data, including data stretching back to the earliest release, is now accessible to AWS users at no cost via the AWS Open Data Sponsorship Program. AWS users will no longer need to pay transfer fees or long-term storage costs to access gnomAD data, or to maintain a personal copy of gnomAD data. By democratizing access to gnomAD data through this collaboration, the Broad Institute hopes to accelerate breakthrough genomic discoveries that enhance the scientific community’s understanding of human genetics and result in solutions that improve the lives of people all over the world.
The mission of the AWS Open Data Sponsorship Program aligns closely with the Broad’s commitment to make genomics tools available to the world. As the industry anticipates further exponential growth of human genomic datasets over the next few years, the Broad and AWS believe that the computational genomics community can benefit from free access to shared datasets. By reducing unnecessary duplication of terabyte- and petabyte-scale genomic datasets, we as a community save scarce environmental, capital, and human resources that would otherwise be spent maintaining many copies across separate institutions. With this collaboration the Broad Institute hopes to provide an avenue for more individuals and organizations to participate in creative research in human genomics, with potential downstream benefits to us all.
What’s included
- All official gnomAD release data, comprising summary statistics and annotations for over 241 million unique short human genetic variants and 335,000 structural variants observed in over 141,000 healthy adult individuals across a diverse range of genetic ancestry groups
- Standard “truth” sets used to assess and validate variant calls
- Interval lists and other resources used in the creation of gnomAD releases
- Data from the Broad’s latest collection of papers in Nature
How to access it
To browse the bucket, download the AWS Command Line Interface and type:
If you don’t yet have an AWS account set up, you’ll need to add a--no-sign-request
flag before s3
to browse the bucket.
For a tutorial on using Hail to run computational pipelines on gnomAD data, see the Hail on AWS Quick Start.
Better together
Researchers are applying gnomAD data to diverse ends, such as:
- Annotating their own genomic sequencing data with gnomAD’s high-quality allele frequency data;
- Using gnomAD’s germline variant calls to separate driver and passenger mutations in cancer samples;
- Building statistical models to help determine the disease-causing potential of genetic variants;
- Comparing mutational tolerance of human genes to their orthologs in other species
These examples are just the beginning of an innovative and wide-ranging catalogue of applications that AWS users are empowered to developed through this collaboration. Additionally, gnomAD joins a deep library of other human genomic and genomic-adjacent datasets on the Registry of Open Data on AWS, including:
- 1000 Genomes
- UK BioBank Pan-Ancestry Summary Statistics
- Encyclopedia of DNA Elements (ENCODE)
- Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
- The Cancer Genome Atlas
- Gabriella Miller Kids First Pediatric Research Program
- iHART Whole Genome Sequencing Data Set
- Cancer Cell Line Encyclopedia
- Broad Genome References
- Genome in a Bottle
- Human PanGenomics Project
- Genome Ark
- International Cancer Genome Consortium PCAWG Study
- Cancer Genome Characterization Initiatives
- Open Targets
The Broad Institute and AWS are looking forward to seeing what new and interesting questions the global genomics community will be able to answer by bringing these, and other, datasets together in the AWS Cloud. Contact the team to let us know about your insights and breakthroughs using gnomAD data, and if the Registry of Open Data helps you along the way.
—
Grace Tiao, Associate Director, Computational Genomics – The Broad Institute
Grace leads a team of computational biologists developing efficient methods and pipelines to produce and analyze large-scale sequencing callsets, including the Genome Aggregation Database (gnomAD), the UK BioBank, and rare disease cohorts generated by the Broad’s Center for Mendelian Genomics. She is the product owner for gnomAD and directs the production of gnomAD releases. Grace studied statistics and mathematics at the University of Oxford and worked for several years in the Cancer Genome Analysis group at the Broad.