AWS Public Sector Blog

Street-scale global maps, orca sounds, and COVID-19 detection data: The latest open data on AWS

The AWS Open Data Sponsorship Program makes high-value, cloud-optimized datasets publicly available on Amazon Web Services (AWS). We work with data providers to democratize access to data by making it available to the public for analysis on AWS; to develop new cloud-native techniques, formats, and tools that lower the cost of working with data; and to encourage the development of communities that benefit from access to shared datasets.

Our full list of publicly available datasets are on the Registry of Open Data on AWS. This quarter, we released 19 new or updated datasets including validated OpenStreetMap data, bioacoustic data, and COVID-19 detection data. Check out some highlights:

Daylight Map Distribution
The Daylight Map Distribution contains a validated subset of the OpenStreetMap (OSM) database with quality and consistency checks to create a free, stable, and simple-to-use street-scale global map. This project is the result of work by Meta to generate a monthly OSM dataset that has been corrected for vandalism, profanity, and other harmful edits. Included in this latest release of v1.9 are parquet files that are optimized for loading data into Amazon Athena in addition to the complete Daylight Map Distribution planet file available in OSM PBF format. Daylight joins existing OSM data, which collectively provides a rich source of community maintained map data used by many different users and applications around the world. Learn more about the Daylight Map Distribution of OpenStreetMap.

Bioacoustic datasets
With the addition of Orcasound to the Registry of Open Data, we continue to see more bioacoustic datasets openly available on AWS. The previously available Pacific Ocean Sound Recordings and Sounds of Central African Landscapes provide sounds from a deep-ocean environment off Central California and rainforest landscapes in Central Africa, respectively. Orcasound contains live-streamed and archived audio data from underwater microphones containing marine biological signals as well as ambient ocean noise. These monitoring efforts prioritize detection of orca sounds and potentially harmful noise. In addition to the raw data, annotated data is provided in the dataset to use as inputs for machine learning models to automate detections in the future. Learn more about Orcasound.

STOIC2021 Training dataset
Managed by Radboud University Medical Center, the STOIC2021 Training dataset comprises 2,000 computed tomography (CT) studies of pneumonia patients accompanied by basic demographics, COVID-19 testing status, and some long-term follow-up data. The STOIC2021 Training dataset is curated by the Assistance Publique des Hôpitaux de Paris as part of the larger STOIC2021 project, described in detail in this flagship publication. The STOIC2021 Training dataset serves as the starting point for the STOIC2021 challenge hosted on platform, which gives scientists the task of predicting COVID-19 severity based on CT features. CTs are a cheaper and faster alternative to magnetic resonance imaging (MRI) while still giving excellent tissue and spatial definition as compared to X-rays, making it an attractive imaging modality in critical situations. Learn more about the STOIC2021 Training dataset.

Find these and other recently released datasets in the latest What’s New.

We’re excited to see how you can put these great datasets to work. If you have examples of tutorials, applications, tools, or publications that use these datasets, make sure to list them on the Registry of Open Data on AWS so the community can find them. Learn how to propose your dataset to the AWS Open Data Sponsorship Program and learn more about open data on AWS.

Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.

Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.

Joe Flasher

Joe Flasher

Joe Flasher is the open data lead at Amazon Web Services (AWS), helping organizations most effectively make data available for analysis in the cloud. The AWS Open Data program has democratized access to petabytes of data, including satellite imagery, climate & weather data, genomic data, and data used for natural language processing. He has been working with geospatial data and open source projects for the past decade, both as a contributor and maintainer. He has been a member of the Landsat Advisory Group and has worked on projects ranging from building GIS software to making the space shuttle fly. His background is in astrophysics, but kindly requests you don’t ask him any questions about constellations.