Climate data, koala genomes, analysis ready radar data, and highly-queryable genomic data: The latest open data on AWS

koala in tree

The AWS Open Data Sponsorship Program makes high-value, cloud-optimized datasets publicly available on Amazon Web Services (AWS). We work with data providers to democratize access to data by making it available to the public for analysis on AWS; develop new cloud-native techniques, formats, and tools that lower the cost of working with data; and encourage the development of communities that benefit from access to shared datasets.

Our full list of publicly available datasets are on the Registry of Open Data on AWS. This quarter, we released 26 new or updated datasets including datasets on climate, koala genomes, analysis ready radar data, and highly-queryable genomic data. Check out some highlights:

NOAA US Climate Normals & Gridded Dataset (NClimGrid)

These two datasets feature historical maximum temperature, minimum temperature, average temperature, and precipitation for the US. These datasets are useful for understanding the expected seasonal temperature and precipitation ranges. This information can be used to decide if a home should include air conditioning, which plants will grow best with the expected rainfall, or how much energy needs to be produced to keep the office building warm during the winter.

The climate normals dataset includes data from almost 15,000 weather stations, derived over uniform 30-year periods and updated every 10 years. The most recent update was made available this summer and is included here. The gridded climate dataset features the same variables interpolated to a uniform grid over the US.

Analysis Ready Radar Dataset

This Synthetic Aperture Radar (SAR) derived dataset contains a collection of repeat-pass Sentinel-1 coherence and backscatter maps covering four seasons over continuous land areas across the globe for a single year capture period. When applied in agricultural and earth science research, it can be used to analyze variability in crop types, vegetation, soil moisture, and other landscape characteristics, irrespective of weather or illumination conditions. Available as 1×1 degree tiles in Cloud Optimized GeoTIFF (COG) format, this 90-meter resolution dataset compliments a growing list of SAR open data including RADARSAT-1 and other Sentinel-1 datasets.

1000 Genomes Reanalysis with Illumina DRAGEN 3.5 & gnomAD Data lake house ready versions

Variant call files (VCFs), the output of secondary genomic analyses, describe the sequence, past annotation, and sometimes predicted effects of genetic variants discovered in the DNA samples that have been sequenced. As population genomics datasets grow larger, VCFs fail to scale—and being able to efficiently mine and query VCFs is key to downstream applications of genomic sequencing, such as cohort generation for clinical trials or identifying appropriate gene targets for drug therapy. Conversion to a compressed columnar format like Apache Parquet improves query performance markedly with services like Amazon Athena and Amazon Redshift—making it easier to get straight to discovery. These datasets join ChEMBL, OpenTargets, and other life sciences datasets that are ready to enroll into data lake houses.

Australasian Genomes

Australasian Genomes is the genomic data repository for the Threatened Species Initiative (TSI). This repository contains reference genomes, transcriptomes, resequenced whole genomes, and reduced representation sequencing data from Australasian species such as koalas, bilbies, and many others. The first set of genomes is now available for researchers to access. “By generating this new, broader genome dataset, we can start to ask questions like, ‘Do koalas living to the west have gene variants that we don’t see on the east coast, where it’s colder and wetter—so are those genes potentially important for dealing with climate change?’” said Dr. Carolyn Hogg, senior research manager for the Australasian Wildlife Genomics Group at the University of Sydney. Learn more about the data and the project.

Find these and other recently released datasets in the latest What’s New.

We’re excited to see how you can put these great datasets to work. If you have examples of tutorials, applications, tools, or publications that use these datasets, make sure to list them on the Registry of Open Data on AWS so the community can find them. Learn how to propose your dataset to the AWS Open Data Sponsorship Program and learn more about open data on AWS.

Joe Flasher is the open data lead at Amazon Web Services (AWS), helping organizations most effectively make data available for analysis in the cloud. The AWS Open Data program has democratized access to petabytes of data, including satellite imagery, climate & weather data, genomic data, and data used for natural language processing. He has been working with geospatial data and open source projects for the past decade, both as a contributor and maintainer. He has been a member of the Landsat Advisory Group and has worked on projects ranging from building GIS software to making the space shuttle fly. His background is in astrophysics, but kindly requests you don’t ask him any questions about constellations.