About AWS Open Data Sponsorship Program

This program exists to help people discover and share data sets that are available by using AWS resources. Unless specifically stated in the applicable data set documentation, data sets available through the Registry of Open Data on AWS are not provided or maintained by AWS. Data sets are provided and maintained by a variety of third parties under a variety of licenses. Please check data set licenses and related documentation to determine if a data set may be used for you application. If you have a project using a listed data set please tell us about it at opendata@amazon.com.

AWS Open Data Sponsorship Program

Visit the AWS Open Data Sponsorship Program website

AWS Open Data Sponsorship Program Products (295)

showing 81 - 90

Reference data for HiFi human WGS

Sold by AWS Open Data Sponsorship Program

Free | Publicly available

Reference data bundle for analyzing HiFi human whole genome sequencing data

Indexes for Kaiju

Sold by AWS Open Data Sponsorship Program

Free | Publicly available

This dataset comprises pre-built indexes for the bioinformatics software Kaiju, which is used for taxonomic classification of metagenomic sequencing data. Various indexes for different source reference databases are available.

ICGC on AWS

Sold by AWS Open Data Sponsorship Program

Free | Publicly available

The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.

Exceptional Responders Initiative

Sold by AWS Open Data Sponsorship Program

Free | Publicly available

The Exceptional Responders Initiative is a pilot study to investigate the underlying molecular factors driving exceptional treatment responses of cancer patients to drug therapies. Study researchers will examine molecular profiles of tumors from patients either enrolled in a clinical trial for an investigational drug(s) and who achieved an exceptional response relative to other trial participants, or who achieved an exceptional response to a non-investigational chemotherapy. An exceptional response is defined as achievement of either a complete response or a partial response for at least 6 months duration in a trial or treatment where the overall response rate is < 10%. The hope is to discover underlying molecular features that can be further investigated and may eventually predict benefit from a given drug or class of drugs for a particular patient. This pilot project will successfully characterize approximately 100 cases of tumor tissue and, when available, case-matched germline[...]

Clinical Trial Sequencing Project - Diffuse Large B-Cell Lymphoma

Sold by AWS Open Data Sponsorship Program

Free | Publicly available

The goal of the project is to identify recurrent genetic alterations (mutations, deletions, amplifications, rearrangements) and/or gene expression signatures. National Cancer Institute (NCI) utilized whole genome sequencing and/or whole exome sequencing in conjunction with transcriptome sequencing. The samples were processed and submitted for genomic characterization using pipelines and procedures established within The Cancer Genome Analysis (TCGA) project.

NapierOne Mixed File Dataset

Sold by AWS Open Data Sponsorship Program

Free | Publicly available

NapierOne is a modern cybersecurity mixed file data set, primarily aimed at, but not limited to, ransomware detection and forensic analysis. The dataset contains over 500,000 distinct files, representing 44 distinct popular file types. It was designed to address the known deficiency in research reproducibility and improve consistency by facilitating research replication and repeatability. The data set was inspired by the Govdocs1 data set and it is intended that ‘NapierOne’ be used as a complement to this original data set. An investigation was performed with the goal of determining the common files types currently in use. No specific research was found that explicitly provided this information, so an alternative consensus approach was employed. This involved combining the findings from multiple sources of file type usage into an overall ranked list. After which 5,000 real-world example files were gathered, and a specific data subset was created, for each of the common file types [...]

Astrophysics Division Galaxy Segmentation Benchmark Dataset

Sold by AWS Open Data Sponsorship Program

Free | Publicly available

Pan-STARSS imaging data and associated labels for galaxy segmentation into galactic centers, galactic bars, spiral arms and foreground stars derived from citizen scientist labels from the Galaxy Zoo: 3D project.

Allen Brain Observatory - Visual Coding AWS Public Data Set

Sold by AWS Open Data Sponsorship Program

Free | Publicly available

The Allen Brain Observatory – Visual Coding is a large-scale, standardized survey of physiological activity across the mouse visual cortex, hippocampus, and thalamus. It includes datasets collected with both two-photon imaging and Neuropixels probes, two complementary techniques for measuring the activity of neurons in vivo. The two-photon imaging dataset features visually evoked calcium responses from GCaMP6-expressing neurons in a range of cortical layers, visual areas, and Cre lines. The Neuropixels dataset features spiking activity from distributed cortical and subcortical brain regions, collected under analogous conditions to the two-photon imaging experiments. We hope that experimentalists and modelers will use these comprehensive, open datasets as a testbed for theories of visual information processing.

VENUS L2A Cloud-Optimized GeoTIFFs

Sold by AWS Open Data Sponsorship Program

Free | Publicly available

The Venµs science mission is a joint research mission undertaken by CNES and ISA, the Israel Space Agency. It aims to demonstrate the effectiveness of high-resolution multi-temporal observation optimised through Copernicus, the global environmental and security monitoring programme. Venµs was launched from the Centre Spatial Guyanais by a VEGA rocket, during the night from 2017, August 1st to 2nd. Thanks to its multispectral camera (12 spectral bands in the visible and near-infrared ranges, with spectral characteristics provided here), it acquires imagery every 1-2 days over 100+ areas at a spatial resolution of 4 to 5m. This dataset has been converted into Cloud Optimized GeoTIFFs (COGs). Additionally, SpatioTemporal Asset Catalog metadata are generated in a JSON file alongside the data. This dataset contains all of the Venus L2A datasets and will continue to grow as the Venu[...]

Seattle Alzheimer's Disease Brain Cell Atlas (SEA-AD)

Sold by AWS Open Data Sponsorship Program

Free | Publicly available

The Seattle Alzheimer's Disease Brain Cell Atlas (SEA-AD) consortium strives to gain a deep molecular and cellular understanding of the early pathogenesis of Alzheimer's disease and is funded by the National Institutes on Aging (NIA U19AG060909). The SEA-AD datasets available here comprise single cell profiling (transcriptomics and epigenomics) and quantitative neuropathology. To explore gene expression and chromatin accessibility information, the single-cell profiling data includes: snRNAseq and snATAC-seq data from the SEA-AD donor cohort (aged brains which span the spectrum of Alzheimer's Disease pathology) and neurotypical reference brains. To explore key pathological proteins and cell types of interest to Alzheimer's disease, the neuropathology data includes: full resolution brightfield images, images processed and segmented in HALO image analysis software, image annotations, and quantification summary files for the relevant stains including Abeta (6E10), IBA1, a-Synuclein, G[...]

showing 81 - 90