The Genome in a Bottle Consortium (GIAB) is a public-private-academic consortium hosted by the National Institute of Standards and Technology (NIST) to develop the technical infrastructure (reference standards, reference methods, and reference data) to enable translation of whole human genome sequencing to clinical practice. The GIAB Consortium has selected several genomes to produce and characterize as reference materials. NIST is developing NIST Reference Materials from these genomes, which are DNA extracted from a large homogenized growth of B lymphoblastoid cell lines from the Coriell Institute for Medical Research.

A mirror of the complete data set from the GIAB project is freely available on Amazon S3. Now anyone can use the data on-demand without worrying about storage costs and download time.

For more information, please visit www.genomeinabottle.org. If you have any questions, please email admin@genomeinabottle.org. A description of all data generated by GIAB for the genomes below is described in a preprint at: http://biorxiv.org/content/early/2015/09/15/026468.

The latest data is publicly available in the GIAB Amazon S3 bucket in US-East (N. Virginia) region: http://giab.s3.amazonaws.com/ or s3://giab. The structure of the bucket is fully described in the file README.ftp_structure and manifest of all files is available within the current.tree file.

Please be aware that the prefix “ftp” should be removed from all paths within those files. For example, the following unix commands get the list of all files from current.tree as S3 HTTP URLs:

curl -s -O http://giab.s3.amazonaws.com/current.tree
grep file current.tree | cut -f 1 | sed -e 's/^ftp//' | awk '{print "http://giab.s3.amazonaws.com" $1}' > giab_s3_urls

The “giab_s3_urls” file will now contain lines formatted as S3 URLs:

http://giab.s3.amazonaws.com/current.tree
http://giab.s3.amazonaws.com/README.ftp_structure

You can access the data via simple HTTP requests, or take advantage of the AWS Command Line Interface or AWS SDKs in languages such as Ruby, Java, Python, .NET and PHP.

Source

The Genome in a Bottle Consortium

Category

Genomics, Life Sciences

Format

FASTQ, BAM, VCF, BED , TSV, HDF5 (PacBio and ONT)

License

There are no restrictions on the use of this data. More information on citation is available here.

Storage Service

Amazon S3

Location

s3://giab in US Standard (N. Virginia)

Update Frequency

New data are added as soon as they are available.

Educators, researchers and students can apply for free AWS credits to take advantage of the utility computing platform offered by AWS, along with Public Data Sets such as the Genome in a Bottle data. If you're running a genomics workshop or have a research project which could take advantage of this dataset, you can apply for AWS Cloud Credits for Research.