Climate data, koala genomes, analysis ready radar data, and highly-queryable genomic data: The latest open data on AWS
The AWS Open Data Sponsorship Program makes high-value, cloud-optimized datasets publicly available on Amazon Web Services (AWS). We work with data providers to democratize access to data by making it available to the public for analysis on AWS; develop new cloud-native techniques, formats, and tools that lower the cost of working with data; and encourage the development of communities that benefit from access to shared datasets.
Our full list of publicly available datasets are on the Registry of Open Data on AWS. This quarter, we released 26 new or updated datasets including datasets on climate, koala genomes, analysis ready radar data, and highly-queryable genomic data. Check out some highlights:
These two datasets feature historical maximum temperature, minimum temperature, average temperature, and precipitation for the US. These datasets are useful for understanding the expected seasonal temperature and precipitation ranges. This information can be used to decide if a home should include air conditioning, which plants will grow best with the expected rainfall, or how much energy needs to be produced to keep the office building warm during the winter.
The climate normals dataset includes data from almost 15,000 weather stations, derived over uniform 30-year periods and updated every 10 years. The most recent update was made available this summer and is included here. The gridded climate dataset features the same variables interpolated to a uniform grid over the US.
This Synthetic Aperture Radar (SAR) derived dataset contains a collection of repeat-pass Sentinel-1 coherence and backscatter maps covering four seasons over continuous land areas across the globe for a single year capture period. When applied in agricultural and earth science research, it can be used to analyze variability in crop types, vegetation, soil moisture, and other landscape characteristics, irrespective of weather or illumination conditions. Available as 1×1 degree tiles in Cloud Optimized GeoTIFF (COG) format, this 90-meter resolution dataset compliments a growing list of SAR open data including RADARSAT-1 and other Sentinel-1 datasets.
1000 Genomes Reanalysis with Illumina DRAGEN 3.5 & gnomAD — Data lake house ready versions
Variant call files (VCFs), the output of secondary genomic analyses, describe the sequence, past annotation, and sometimes predicted effects of genetic variants discovered in the DNA samples that have been sequenced. As population genomics datasets grow larger, VCFs fail to scale—and being able to efficiently mine and query VCFs is key to downstream applications of genomic sequencing, such as cohort generation for clinical trials or identifying appropriate gene targets for drug therapy. Conversion to a compressed columnar format like Apache Parquet improves query performance markedly with services like Amazon Athena and Amazon Redshift—making it easier to get straight to discovery. These datasets join ChEMBL, OpenTargets, and other life sciences datasets that are ready to enroll into data lake houses.
Australasian Genomes is the genomic data repository for the Threatened Species Initiative (TSI). This repository contains reference genomes, transcriptomes, resequenced whole genomes, and reduced representation sequencing data from Australasian species such as koalas, bilbies, and many others. The first set of genomes is now available for researchers to access. “By generating this new, broader genome dataset, we can start to ask questions like, ‘Do koalas living to the west have gene variants that we don’t see on the east coast, where it’s colder and wetter—so are those genes potentially important for dealing with climate change?’” said Dr. Carolyn Hogg, senior research manager for the Australasian Wildlife Genomics Group at the University of Sydney. Learn more about the data and the project.
We’re excited to see how you can put these great datasets to work. If you have examples of tutorials, applications, tools, or publications that use these datasets, make sure to list them on the Registry of Open Data on AWS so the community can find them. Learn how to propose your dataset to the AWS Open Data Sponsorship Program and learn more about open data on AWS.