SARS-CoV-2 viral genomes, storm surge forecasts, cloud-free satellite imagery: The latest open data on AWS
The AWS Open Data Sponsorship Program makes high-value, cloud-optimized datasets publicly available on Amazon Web Services (AWS). We work with data providers to democratize access to data by making it available for analysis on AWS; develop new cloud-native techniques, formats, and tools that lower the cost of working with data; and encourage the development of communities that benefit from access to shared datasets.
Our full list of publicly available datasets are on the Registry of Open Data on AWS. This quarter, we released 28 new datasets including data on SARS-CoV-2 viral genomes, storm surge forecasts, and US census data. Check out some highlights.
Sentinel-2 L2A 120 meter Mosaic managed by Sinergise
The Copernicus Programme Sentinel-2 satellite imagery made available on AWS by the geospatial company Sinergise is used in applications ranging from disaster response to agriculture to water body monitoring. At 10 meter resolution, it is some of the highest resolution publicly available data.
With 10 meter resolution images, clouds can obscure what you are trying to take an image of, and the high resolution can be a lot to process on a global scale. Sinergise set out to address these issues with the release of their latest dataset Sentinel-2 L2A 120m Mosaic. This mosaic provides a cloud-free time-series resulting from the processing of more than 5PB of satellite imagery. This data is available in Cloud Optimized GeoTIFF format and can be quickly pulled into many machine learning (ML) models. At a resolution of 120 meter, it’s simpler to test global algorithms before moving on to higher resolution datasets.
1940 US Census and National Archives Catalog from National Archives and Records Administration
Every ten years, the US collects a trove of demographic information about its citizens as part of its Decennial Census. Summaries are released to the general public soon after, but the original forms are kept private for a statutory 72 years to preserve the privacy of individuals recorded. Now, people can derive insights from any aspect of the 1940 Census Dataset using the full set of tools AWS provides. The 1940 Census dataset includes the metadata index, the population schedules, the enumeration district maps, and the enumeration district descriptions for the 1940 US Census records. The 1940 Census population schedules were created by the Bureau of the Census in an attempt to enumerate every person living in the United States on April 1, 1940, although some persons were missed. The population schedules were digitized by the National Archives and Records Administration (NARA) and publicly released on April 2, 2012.
The National Archives Catalog dataset on the Registry of Open Data on AWS includes the archival descriptions and authority records from the National Archives Catalog (as of November 20, 2020), including the URLs for over 127 million digital objects and data from citizen archivist contributions.
PubSeq COVID-19 Public Sequence Resource from University of Tennessee Health Sciences Center GeneNetwork
In recent months, leading genomic researchers have called for the submission of SARS-CoV-2 genomic sequence to open genomic data repositories as a key factor to winning the battle against COVID-19. PubSeq, an open access sequence repository, currently contains over 33,000 SARS-CoV-2 viral genomes with rich associated metadata (sample location, type, submitting lab information) that can be queried using Amazon Simple Storage Service (Amazon S3) Select or Amazon Athena. PubSeq also integrates seamlessly with Arvados for on-the-fly analysis of sequenced SARS-CoV-2 samples and rapid identification of novel viral strains. It joins the Coronavirus Genome Sequence Dataset provided by the National Center for Biotechnology Information in the Registry of Open Data on AWS, making this one of the richest sources of coronavirus genome sequence data freely available to the public.
NOAA Global Extratropical Surge and Tide Operational Forecast System (Global ESTOFS)
The Global ESTOFS is the highest resolution operational global tide and storm surge model available today. The coastline resolution is at least 1.5 km—up to 80 meters globally with higher resolution (25 meters) in key ports. The Global ESTOFS is an Advanced CIRCulation (ADCIRC) core hydrodynamic model. ADCIRC is a hydrodynamic modeling technology that conducts short- and long-term simulations of tide and storm surge elevations and velocities in deep-ocean, continental shelves, coastal seas, and small-scale estuarine systems. The Global ESTOFS provides seven-day water level forecasts using the 10 meters winds, mean sea level pressure, and sea ice variables from the NOAA Global Forecast System (GFS) atmospheric model. The water level forecasts include guidance for tides, storm surge, and their combinations over four cycles a day: 00:00, 06:00, 12:00, and 18:00 UTC.
Global ESTOFS serves the needs of National Centers for Environmental Prediction’s (NCEP) Ocean Prediction Center (OPC) and the National Hurricane Center’s Tropical Analysis and Forecast Branch (NHC/TAFB), who are responsible for providing offshore marine forecasts. It meets the needs of Weather Forecast Offices for issuing coastal inundation forecasts, and helps communicate to local officials the magnitude of any expected storm surge from tropical systems.
You can find a list of other recently released datasets in this What’s New post.
We’re excited to see how you can put these great datasets to work. If you have examples of tutorials, applications, tools, or publications that use these datasets, make sure to list them on the Registry of Open Data on AWS so the community can find them. Learn how to propose your dataset to the AWS Open Data Sponsorship Program and learn more about open data on AWS.