Free | Publicly available
The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.
This program exists to help people discover and share data sets that are available by using AWS resources. Unless specifically stated in the applicable data set documentation, data sets available through the Registry of Open Data on AWS are not provided or maintained by AWS. Data sets are provided and maintained by a variety of third parties under a variety of licenses. Please check data set licenses and related documentation to determine if a data set may be used for you application. If you have a project using a listed data set please tell us about it at opendata@amazon.com.
Free | Publicly available
The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.
Free | Publicly available
recount3 is an online resource consisting of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for human and mouse respectively. It is the third generation of the ReCount project and part of recount.bio. recount2 is also included for historical purposes. The pipeline used to generate the data in recount3 (but not recount2) is available here.
Free | Publicly available
Co-managed by Toyoko and the Structural Biology Group at the Universidad Nacional de Quilmes, this dataset allows us to explore the conformational space of all possible peptides using the 20 common amino acids. It consists of a collection of exhaustive molecular dynamics simulations of tripeptides and pentapeptides.
Free | Publicly available
The Foundation Medicine Adult Cancer Clinical Dataset (FM-AD) is a study conducted by Foundation Medicine Inc (FMI). Genomic profiling data for approximately 18,000 adult patients with a diverse array of cancers was generated using FoundationeOne, FMI's commercially available, comprehensive genomic profiling assay. This dataset contains open Clinical and Biospecimen data.
Free | Publicly available
The WSA-Enlil heliospheric model provides critical information regarding the propagation of solar Coronal Mass Ejections (CMEs) and transient structures within the heliosphere. Two distinct models comprise the WSA-Enlil modeling system; 1) the Wang-Sheeley-Arge (WSA) semi-empirical solar coronal model, and 2) the Enlil magnetohydrodynamic (MHD) heliospheric model. MHD modeling of the full domain (solar photosphere to Earth) is extremely computationally demanding due to the large parameter space and resulting characteristic speeds within the system. To reduce the computational burden and improve the timeliness (and hence the utility in forecasting space weather disturbances) of model results, the domain of the MHD model (Enlil) is limited from 21.5 Solar Radii (Rs) to just beyond the orbit of Earth, while the inner portion, spanning from the solar photosphere to 21.5Rs, is characterized by the WSA model. This coupled modeling system is driven by solar synoptic maps composed of nu[...]
Free | Publicly available
The Transiting Exoplanet Survey Satellite (TESS) is a multi-year survey that has discovered exoplanets in orbit around bright stars across the entire sky using high-precision photometry. The survey also enables a wide variety of stellar astrophysics, solar system science, and extragalactic variability studies. More information about TESS is available at MAST and the TESS Science Support Center.
Free | Publicly available
The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.
Free | Publicly available
Japanese dictionaries and pre-trained models (word embeddings and language models) for natural language processing. SudachiDict is the dictionary for a Japanese tokenizer (morphological analyzer) Sudachi. chiVe is Japanese pretrained word embeddings (word vectors), trained using the ultra-large-scale web corpus NWJC by National Institute for Japanese Language and Linguistics, analyzed by Sudachi. chiTra is a library for using large-scale pre-trained language models with the Japanese tokenizer SudachiPy.
Free | Publicly available
The dataset contains reference samples that will be useful for benchmarking and comparing bioinformatics tools for genome analysis. Currently, there are two samples, which are NA12878 (HG001) and NA24385 (HG002), sequenced on an Oxford Nanopore Technologies (ONT) PromethION using the latest R10.4.1 flowcells. Raw signal data output by the sequencer is provided for these datasets in BLOW5 format, and can be rebasecalled when basecalling software updates bring accuracy and feature improvements over the years. Raw signal data is not only for rebasecalling, but also can be used for emerging bioinformatics tools that directly analyse raw signal data. We also provide the basecalled data alongside the raw signal data and will continue to provide updated basecalls when there is a major update to the basecalling software. In the future, we plan to extend this open dataset with additional samples, including sequencing runs from vendors other than ONT.
Free | Publicly available
This is a retrospective dataset of 1523 H&E-stained whole slide images (WSI) of lymph nodes from breast cancer patients. The cohort consisted of 177 patients (122 LN-positive - metastasis was reported in at least 1 LN - and 55 LN-negative patients) with invasive breast carcinoma treated between 1984 and 2002 at Guy’s Hospital London, UK. Slides were scanned and digitised at 40x magnification (0.23 µm/pixel), NanoZoomer H.T2.0 2.0-HT (Hamamatsu Photonics UK, Ltd, Welwyn Garden City, UK). WSIs are in .ndpi format.
showing 71 - 80