The Human Microbiome Project and Amazon Web Services
The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performed for thousands of these samples. In addition, whole genome sequences were generated for isolate strains collected from human body sites to act as reference organisms for analysis. Finally, 16S marker and whole metagenome sequencing was also done on additional samples from people suffering from several disease conditions. More information about the HMP is available at the NIH common fund.
This HMP data is now available on Amazon S3 at https://s3-us-west-2.amazonaws.com/human-microbiome-project. The data is publicly available to the community free of charge. Researchers will only need to pay for any additional AWS resources they need for further processing or analysis of the data. Educators, researchers and students can apply for free credits to take advantage of the utility computing platform offered by AWS, along with Public Datasets such as the Human Microbiome Project data. So, if you're running a genomics workshop or have a research project that could take advantage of the hosted HMP dataset - apply for an AWS Grant!
You can access the data using AWS services such as the Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic MapReduce (Amazon EMR), which provide organizations with the highly scalable compute resources needed to take advantage of these large data collections. Retrievals may be performed using HTTP requests, or the AWS software development kits in languages such as Ruby, Java, Python, .NET and PHP. Thanks to the Amazon S3 bucket system users can also crunch the information using Hadoop via Amazon EMR.
An overview of the HMP data
The HMP generated over 14 terabytes of data that includes over 1,128 reference microbial genomes, 9,811 16S sequence datasets, and 1,260 whole metagenome sequence datasets from healthy subjects. In addition, there are analysis results derived from these data including metagenomic assemblies, gene catalogs, and community profiles. An analysis of a subset of these data was published last year.
Analyzing Human Microbiome Project Data
Researchers may use the Amazon EC2 utility computing service to perform analysis on the HMP data without making the capital investment required to work with data at this scale. Users may take advantage of the growing collection of tools for running bioinformatics workflows, such as Cloud Virtual Resource (CLoVR), Quantitative Insights into Microbial Ecology (QIIME),Galaxy, CloudBurst and Crossbow. In the coming months, a suite of metagenomics analysis tools will be made available for public use on AWS, which will include additional documentation and tutorials.
Other forms of HMP access
The HMP was supported by the NIH Common Fund, which also established a HMP Data Analysis and Coordination Center (HMP-DACC) to serve as the central repository for all HMP data. All HMP data can be accessed at the HMP-DACC site or the NCBI Sequence Read Archive.