AWS Public Sector Blog

36 new or updated datasets on the Registry of Open Data: AI analysis-ready datasets and more

36 new or updated datasets on the Registry of Open Data: AI analysis-ready datasets and more

The AWS Open Data Sponsorship Program makes high-value, cloud-optimized datasets publicly available on Amazon Web Services (AWS). AWS works with data providers to democratize access to data by making it available to the public for analysis on AWS; develop new cloud-native techniques, formats, and tools that lower the cost of working with data; and encourage the development of communities that benefit from access to shared datasets. Through this program, customers are making over 100PB of high-value, cloud-optimized data available for public use.

The full list of publicly available datasets are on the Registry of Open Data on AWS and are now also discoverable on AWS Data Exchange. This quarter, AWS released 36 new or updated datasets. As July 16 is Artificial Intelligence (AI) Appreciation Day, the AWS Open Data team is highlighting three unique datasets that are analysis-ready for AI.

What will you build with these datasets?

Three AI analysis-ready datasets on the Registry of Open Data

NYUMets Brain Dataset from the NYU Langone Medical Center is one of the largest datasets in existence of cranial imaging, and the largest dataset of metastatic cancer, containing over 8,000 brain MRI studies, clinical data, and treatment records from cancer patients. Over 2,300 images have been annotated for metastatic tumor segmentations, making NYUMets: Brain a valuable source of segmented medical imaging. An AI model for segmentation tasks as well as a longitudinal tracking tool are available for NYUMets through MONAI. Learn more about this dataset.

RACECAR Dataset from the University of Virginia is the first open dataset for full-scale and high-speed autonomous racing. RACECAR is suitable to explore issues regarding localization, object detection and tracking (LiDAR, Radar, and Camera), and mapping that arise at the limits of operation of the autonomous vehicle. You can get started with RACECAR with this SageMaker Studio Lab notebook.

Aurora Multi-Sensor Dataset from Aurora Operations, Inc. is a large-scale multi-sensor dataset with highly accurate localization ground truth, captured between January 2017 and February 2018 in the metropolitan area of Pittsburgh, PA, USA. The de-identified dataset contains rich metadata, such as weather and semantic segmentation, and spans all four seasons, rain, snow, overcast and sunny days, different times of day, and a variety of traffic conditions. This data can be used to develop and evaluate large-scale long-term approaches to autonomous vehicle localization. Aurora is applicable to many research areas including 3D reconstruction, virtual tourism, HD map construction, and map compression.

Full list of new or updated datasets

These three datasets join 33 other new or updated datasets on the Registry of Open Data in the following categories.

Climate and weather:

Geospatial:

Life sciences:

Machine learning:

What are people doing with open data?

How can you make your data available?

Looking to make your data available? The AWS Open Data Sponsorship Program covers the cost of storage for publicly available high-value, cloud-optimized datasets. We work with data providers who seek to:

  • Democratize access to data by making it available for analysis on AWS
  • Develop new cloud-native techniques, formats, and tools that lower the cost of working with data
  • Encourage the development of communities that benefit from access to shared datasets

Learn how to propose your dataset to the AWS Open Data Sponsorship Program.
Learn more about open data on AWS.

Read more about open data on AWS:

Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.

Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.