AWS hosts a variety of public data sets that anyone can access for free.
Previously, large data sets such as the mapping of the Human Genome required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets via the AWS centralized data repository and analyze them using Amazon EC2 instances or Amazon EMR (Hosted Hadoop) clusters. By hosting this important data where it can be quickly and easily processed with elastic computing resources, AWS hopes to enable more innovation, more quickly.
Click here for the detailed list of available data sets. Here are some examples of popular Public Data Sets:
- Landsat on AWS: An ongoing collection of moderate-resolution satellite imagery of all land on Earth produced by the Landsat 8 satellite
- SpaceNet on AWS: A corpus of commercial satellite imagery and labeled training data to foster innovation in the development of computer vision algorithms
- IRS 990 Filings on AWS: Machine-readable data from certain electronic 990 forms filed with the IRS from 2011 to present
- NEXRAD on AWS: Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network
- NASA NEX: A collection of Earth science data sets maintained by NASA, including climate change projections and satellite images of the Earth's surface
- Common Crawl Corpus: A corpus of web crawl data composed of over 5 billion web pages
- TCGA on AWS: Raw and processed genomic, transcriptomic, and epigenomic data from The Cancer Genome Atlas (TCGA) available to qualified researchers via the Cancer Genomics Cloud
- ICGC on AWS: Whole genome sequence data available to qualified researchers via The International Cancer Genome Consortium (ICGC)
- 1000 Genomes Project: A detailed map of human genetic variation
- 3000 Rice Genome on AWS: Genome sequence of 3,024 rice varieties
- Multimedia Commons: A collection of nearly 100M images and videos with audio and visual features and annotations
- Google Books Ngrams: A data set containing Google Books n-gram corpuses
The public data sets are hosted in two possible formats: Amazon Elastic Block Store (Amazon EBS) snapshots and/or Amazon Simple Storage Service (Amazon S3) buckets.
To access a data set hosted as an Amazon EBS snapshot: Sign up for an AWS account, launch an Amazon EC2 instance, and create an Amazon EBS volume using the Snapshot ID listed in the catalog above. Or, see the Amazon EC2 Getting Started Guide.
To access a public data set hosted in Amazon S3: You can make simple HTTP requests, use AWS Command Line Tools and SDKs (Ruby, Java, Python, .NET, PHP, etc.), download the data using Amazon EC2, or use Hadoop to process the data with Amazon EMR.
If you have any questions or want to participate in our Public Data Sets community, please email us at email@example.com.