Public Data Sets

Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. Learn more about Public Data Sets on AWS and visit the Public Data Sets forum.

Showing 1-25 of 56 results.
Sort by:
A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.
Last Modified: Mar 17, 2014 17:51 PM GMT
Three NASA NEX datasets are now available, including climate projections and satellite images of Earth.
Last Modified: Nov 12, 2013 13:27 PM GMT
The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available.
Last Modified: Oct 8, 2013 14:38 PM GMT
The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available.
Last Modified: Oct 8, 2013 14:37 PM GMT
Human Microbiome Project Data Set
Last Modified: Sep 26, 2013 17:58 PM GMT
The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available.
Last Modified: Jul 18, 2012 16:34 PM GMT
A collection of data from the modENCODE project ( http://www.modencode.org )
Last Modified: Apr 24, 2012 21:18 PM GMT
Multiple data sets including: (1) Population Census of Japan (1995, 2000, 2005, 2010), (2) Establishment and Enterprise Census of Japan (1999, 2001, 2004, 2006), and (3) Economic Census of Japan (2009).
Last Modified: Mar 4, 2012 3:22 AM GMT
Enron email data publicly released as part of FERC's Western Energy Markets investigation converted to industry standard formats by EDRM. The data set consists of 1,227,255 emails with 493,384 attachments covering 151 custodians. The email is provided in Microsoft PST, IETF MIME, and EDRM XML formats.
Last Modified: Feb 15, 2012 2:26 AM GMT
The high-coverage genome sequence of a Denisovan individual sequenced to ~30x coverage on the Illumina platform. Together with their sister group the Neandertals, Denisovans are the most closely related extinct relatives of currently living humans.
Last Modified: Feb 15, 2012 2:22 AM GMT
A data set containing Google Books n-gram corpuses. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from http://books.google.com/ngrams/.
Last Modified: Jan 21, 2012 2:12 AM GMT
The Sloan Digital Sky Survey is the most ambitious astronomical survey ever undertaken.
Last Modified: Jan 20, 2012 21:49 PM GMT
Whole Genome Shotgun Sequencing of the Cannabis Sativa Cultivar "Chemdawg"
Last Modified: Aug 22, 2011 22:33 PM GMT
A collection of all publicly available Apache Software Foundation mail archives as of July 11, 2011
Last Modified: Aug 15, 2011 22:00 PM GMT
A data dump of the basic identifying facts about every topic in Freebase
Last Modified: Jun 24, 2011 18:08 PM GMT
A data dump of all the current facts and assertions in Freebase
Last Modified: Jun 24, 2011 18:04 PM GMT
This dataset contains a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).
Last Modified: Apr 28, 2011 0:00 AM GMT
230,000 Material Safety Data Sheets.
Last Modified: Apr 1, 2011 0:00 AM GMT
The Million Songs Collection is a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.
Last Modified: Feb 8, 2011 0:00 AM GMT
This is a 10,000 song subset of audio features and metadata from the Million Songs collection - a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.
Last Modified: Feb 8, 2011 0:00 AM GMT
This dataset is an example of a social collaboration network based on the characters in The Marvel Universe, that is, the artificial world that takes place in the universe of the Marvel comic books.
Last Modified: Feb 3, 2011 0:00 AM GMT
The WestburyLab USENET corpus is an anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010.
Last Modified: Nov 17, 2010 1:17 AM GMT
Contains 16 months of hourly pageview statistics for all articles in Wikipedia
Last Modified: Sep 8, 2010 21:47 PM GMT
Human Liver Cohort characterizing gene expression in liver samples
Last Modified: Sep 8, 2010 20:56 PM GMT
C57BL/6J by C3H/HeJ mouse cross from the Jake Lusis lab at UCLA
Last Modified: Sep 8, 2010 20:53 PM GMT
Results per page:
©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.