Public Data Sets on AWS
AWS hosts a variety of public data sets that anyone can access for free.
Previously, large data sets such as the mapping of the Human Genome required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets and analyze them using Amazon EC2 instances or Amazon EMR (Hosted Hadoop) clusters. By hosting this important data where it can be quickly and easily processed with elastic computing resources, AWS hopes to enable more innovation, more quickly.
Available Public Data Sets on AWS
Click here for the detailed list of available data sets. Here are some examples of popular Public Data Sets:
- NASA NEX: A collection of Earth science data sets maintained by NASA, including climate change projections and satellite images of the Earth's surface
- Common Crawl Corpus: A corpus of web crawl data composed of over 5 billion web pages
- 1000 Genomes Project: A detailed map of human genetic variation
Google Books Ngrams: A data set containing Google Books n-gram corpuses
- US Census Data: US demographic data from 1980, 1990, and 2000 US Censuses
- Freebase Data Dump: A data dump of all the current facts and assertions in the Freebase system, an open database covering millions of topics
How It Works
The data sets are hosted in two possible formats: Amazon Elastic Block Store (Amazon EBS) snapshots and/or Amazon Simple Storage Service (Amazon S3) buckets.
To access a data set hosted as an Amazon EBS snapshot: Sign up for an AWS account, launch an Amazon EC2 instance, and create an Amazon EBS volume using the Snapshot ID listed in the catalog above. The ElasticFox Getting Started Guide provides a simple walkthrough of how to launch an instance and create an Amazon EBS volume using ElasticFox, a convenient FireFox plug-in. Or, see the Amazon EC2 Getting Started Guide.
To access a data set hosted in Amazon S3: You can make simple HTTP requests, use AWS Command Line Tools and SDKs (Ruby, Java, Python, .NET, PHP, etc.), download the data using Amazon EC2, or use Hadoop to process the data with Amazon EMR.
If you have any questions or want to participate in our Public Data Sets community, please visit our Public Data Sets forum.
How to Share a Public Data Set on AWS
If you have a data set that you think would be interesting to the AWS community, please submit this form. The AWS team will review your submission and get back to you if we believe it is a good fit. You must have the right to make the data freely available, and if the data set is selected, you will need to provide a description of the data set, a description of its schema, and sample code that shows how one might analyze the data.