Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications.
Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. Users can also discuss best practices and solutions in the dedicated Public Data Sets forum.
By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.
AWS will continue to add to the available collection of public domain and non-proprietary data sets over time. The data sets currently available are shown below. The Linux/UNIX snapshots are in ISO9660 or EXT3 format and the Windows snapshots are in NTFS format.
You can obtain a full list of data sets in our Public Data Sets resource center.
Here are some examples of popular Public Data Sets:
Select public data sets are hosted on Amazon EC2 for free as Amazon Elastic Block Store (Amazon EBS) snapshots. Amazon EC2 customers can access this data by creating their own personal Amazon EBS volumes, using the public data set snapshots as a starting point. They can then access, modify and perform computation on these volumes directly using their Amazon EC2 instances and just pay for the compute and storage resources that they use. If available, researchers can also use pre-configured Amazon Machine Images (AMIs) with tools like Inquiry by BioTeam to perform their analysis.
To get started using the Public Data Sets on AWS, simply perform these three easy steps:
The ElasticFox Getting Started Guide provides a simple walkthrough of how to launch an instance and create an Amazon EBS volume using ElasticFox, a convenient FireFox plug-in. Or, see the Amazon EC2 Getting Started Guide.
If you have any questions or want to participate in our Public Data Sets community, please visit our Public Data Sets forum.
If you have a public domain or non-proprietary data set that you think is useful and interesting to the AWS community, please submit a request below and the AWS team will review your submission and get back to you. Typically the data sets in the repository are between 1 GB to 1 TB in size (based on the Amazon EBS volume limit), but we can work with you to host larger data sets as well. You must have the right to make the data freely available.
To get started, please fill in the submission form linked here, and a member of our team will contact you regarding your public data set. We will walk you through publishing your data set to the data repository.