Google Brain Genomics Sequencing Dataset for Benchmarking and Development

To facilitate benchmarking and development, the Google Brain group has sequenced 9 human samples covering the Genome in a Bottle truth sets on different sequencing instruments, sequencing modalities (Illumina short read and Pacific BioSciences long read), sample preparation protocols, and for whole genome and whole exome capture. The original source of these data are [gs://google-brain-genomics-public](https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing;tab=objects?prefix=&forceOnObjectsSortingFiltering=false).

Overview

Features and programs

Open Data Sponsorship Program

This dataset is part of the Open Data Sponsorship Program, an AWS program that covers the cost of storage for publicly available high-value cloud-optimized datasets.

Learn more

Pricing

This is a publicly available data set. No subscription is required.

How can we make this page better?

Tell us how we can improve this page, or report an issue with this product.

Legal

Content disclaimer

Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

Usage information

Info

Delivery details

AWS Data Exchange (ADX)

AWS Data Exchange is a service that helps AWS easily share and manage data entitlements from other organizations at scale.

Open data resources

Available with or without an AWS account.

How to use: To access these resources, reference the Amazon Resource Name (ARN) using the AWS Command Line Interface (CLI). Learn more

Description: FASTQ files for nine human samples comprising three parent-child trios at 40X 30X, and 20X coverage for whole genome sequencing and 100X, 75X, and 50X coverage for whole exome sequencing. Four samples are also sequenced with PacBio HiFi as described in Baid et al 2020. Note that this S3 bucket only contains FASTQ files; BAM and VCF files are available at [gs://google-brain-genomics-public](https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing;tab=objects?prefix=&forceOnObjectsSortingFiltering=false).
Resource type: S3 bucket
Amazon Resource Name (ARN): arn:aws:s3:::genomics-benchmark-datasets/google-brain
AWS region: us-east-1
AWS CLI access (No AWS account required): aws s3 ls --no-sign-request s3://genomics-benchmark-datasets/google-brain/

Resources

Vendor resources

View this dataset on Github

Support

Contact

genomics-benchmark-datasets@amazon.com

Managed By

Amazon Web Services

How to cite

Google Brain Genomics Sequencing Dataset for Benchmarking and Development was accessed on DATE from https://registry.opendata.aws/google-brain-genomics-public .

License

CC0 1.0

Similar products

Oxford Nanopore Technologies Benchmark Datasets

By Oxford Nanopore Technologies part of the AWS Open Data Sponsorship Program

The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. Genome In A Bottle reference cell lines). Raw data are provided with metadata and scripts to describe sample and data provenance.

View product

Somatic Mosaicism across Human Tissues (SMaHT)

By SMaHT Data Analysis Center (DAC) part of the AWS Open Data Sponsorship Program

The Somatic Mosaicism across Human Tissues (SMaHT) project is an NIH Common Fund consortium (2023-) aimed to comprehensively characterize somatic variation ("mosaicism") in normal human tissues. While most genetic studies have relied on blood-derived DNA, SMaHT captures the full spectrum of DNA variation across cell types, tissues, and organs from phenotypically normal individuals to better understand the role of somatic mosaicism in human development, aging, and disease progression. Researchers in the consortium develop and apply experimental and computational methods, paired with the state-of-the-art sequencing technologies, to accurately detect even rare mutations (frequency < 1%) in subpopulations of cells. In addition to generating the production data across ~20 tissue types from 150 post-mortem donors, SMaHT also produces datasets from cell line and tissue homogenate samples, to benchmark and develop new technologies and computational tools for mosaic variant detection. Th[...]

View product

Strategic Advisory for Base Editing Market AWS Roadmaps

By Next Move Strategy Consulting

60-minute live analyst briefing focused on the Base Editing Market (2026–2035), helping biotechnology, genomics, and healthcare strategy leaders optimize AWS cloud infrastructure, AI workloads, and genomic analytics roadmaps.

View product

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5, 3.7, 4.0, 4.2, and 4.4

By Illumina, Inc. part of the AWS Open Data Sponsorship Program

Overview This dataset contains alignment files and small variant (includes single nucleotide variants (SNV) and indels), copy number variant (CNV), short tandem repeat (i.e., repeat expansion; STR), structural variant (SV) and other variant call files from the 1000 Genomes Project (1KGP) Phase 3 dataset (3,202 individuals, 602 trios) using Illumina DRAGEN v3.5.7b, v3.7.6, v4.0.3, v4.2.7, and v4.4.7 software. All DRAGEN analyses were performed in the cloud using the Illumina Connected Analytics bioinformatics platform powered by Amazon Web Services (see 'Data solution empowering population genomics' for more information). The v3.7.6, v4.2.7, and v4.4.7 datasets include results from trio small variant, de novo structural varia[...]

View product