Announcing Registry of Open Data on AWS

We are excited to announce the launch of the Registry of Open Data on AWS (RODA). With RODA, you can find and share data staged publicly for analysis on AWS. Search for datasets by keyword and tags for common types of data, such as genomicsatellite imagery, and transportation, on a simple web interface. Every dataset listed in RODA includes basic information about the data, how to access it on AWS, its license, a link to documentation, and contact information if you have questions. Many listings also include links to tutorials or applications that make use of the data.

The evolution of shared data on AWS

The AWS Public Datasets program was launched almost 10 years ago. Sharing huge volumes of data was an early use case for the cloud because the cloud allows people to work with data quickly, at any scale, and without downloading or storing their own copy. When the AWS Public Datasets program began, most datasets were shared as EBS Snapshots. Now, we share data from public Amazon S3 buckets, and have experimented with sharing data via Amazon SNS and Amazon RDS DB Snapshots.

The AWS Public Datasets program has grown from a way for us to showcase how to share data on AWS to a program that we use to collaborate with AWS customers, including NOAA, the U.S. Department of the Treasury, the UK Met Office, and teams within Amazon. The majority of the datasets you access on AWS are produced and maintained by AWS customers.

Over the years, we have seen the rise of new data formats and standards that make it faster and more cost-efficient to work with cloud object storage services, like Amazon S3. Formats like Apache Parquet, Apache ORC, and the Cloud Optimized GeoTIFF are creating efficiencies by allowing people to make precise queries for data from Amazon S3 and avoid transferring and storing unnecessary data. These community-led approaches are making it possible for people anywhere in the world to perform research and build services at any scale on top of data shared via Amazon S3.

Opening RODA to our community

Since RODA is completely open source, there are two main ways to get involved with RODA

  • If you have data or usage examples that you would like to put into RODA, you can do so by adding it on GitHub. Full guidance on how to create a RODA entry is maintained on GitHub.
  • If you don’t have a dataset to share, you can still contribute to RODA by adding a dataset usage example. If you have built an application or tutorial on top of a dataset listed in RODA, you can add a link to it under the “DataAtWork” field. Simply provide a title for your usage example, its URL, the name of the person or organization that authored it, and an optional link for the author. This can be done by forking with the edit button on GitHub.

We look forward to a future with more experimentation as people discover new methods for sharing data in the cloud.