AWS Public Sector Blog

Celebrate Open Science Week with the Allen Institute and available open datasets

The Allen Institute seeks to understand how our brains, cells, and immune systems work when we are healthy and, ultimately, how they go wrong in disease. In the course of their studies, Allen researchers have generated and shared atlases that map the brain, gene-edited stem cell lines, and many more publicly available resources that have been used by millions of scientists around the world to accelerate their research.

The Allen Institute collaborates with Amazon Web Services (AWS) and the Registry of Open Data on AWS to make many of their datasets publicly available. In celebration of Open Science Week, check out some of these open datasets from the Allen Institute, and their impact on the research community.

The Allen Institute for Brain Science

The Allen Brain Observatory

Released in 2016, the Allen Brain Observatory presents the first standardized survey of physiological activity in the visual cortex of the living mouse. The dataset captures visually evoked calcium responses from neurons in selected regions of the mouse brain, in particular the visual area, using genetically engineered lines of mice. The Allen Brain Observatory serves as a data resource for the annual Summer Workshop on the Dynamic Brain, an intensive two week course co-hosted by the Allen Institute and the University of Washington, now in its eighth year.

The Allen Mouse Brain Atlas

Figure 1. This video was captured from the Allen Mouse Brain Atlas dataset via mouse.brain-map.org.

Figure 1. This video was captured from the Allen Mouse Brain Atlas dataset via mouse.brain-map.org.

The Allen Mouse Brain Atlas contains gene expression profiles of over 20,000 genes in the mouse brain. Using fluorescent in situ hybridization (FISH), the Mouse Brain Atlas allows users to visualize the location and relative abundance of any gene product. By overlaying several FISH preparations at once, users can also draw relationships between gene products: for example, if one gene is being expressed at the same time as another in the same cell, it’s possible that their proteins could be working together in a cellular mechanism. The Mouse Brain Atlas also represents one of the most comprehensive resources for brain anatomy available, with full-color brain sections in two planes available, with respective brain regions labelled for easy reference.

Alongside the Allen Brain Observatory, professors from universities worldwide access the Brain Atlas as a teaching resource not just for neuroanatomy and physiology but also as a source of big data for open questions in neurocoding.

Ivy Glioblastoma Atlas Project (Ivy GAP)

Figure 2. Images of a human glioblastoma brain tumor taken from the Ivy Glioblastoma Atlas Project (Ivy GAP).

Figure 2. Images of a human glioblastoma brain tumor taken from the Ivy Glioblastoma Atlas Project (Ivy GAP).

Glioblastoma is an aggressive cancer of the supportive cells of the central nervous system. While most treatment plans for glioblastoma include surgery and radiation therapy to physically reduce the size of the tumor, the molecular basis of glioblastoma can differ widely across individual patients.

What does that mean for the patient? The heterogeneity of molecular markers across glioblastoma cases means that oncologists—and patients—spend time on diagnostics that could be spent in treatment. The Ivy Glioblastoma Atlas Project (Ivy GAP) aimed to reduce that time by characterizing over 40 glioblastoma samples at the molecular and genetic level. Each image has been annotated for tumor features by a machine learning (ML) process trained by medical experts. Ivy GAP is accompanied by a patient genomic and clinical database which users can register for and request access to, as well as RNA sequencing of each tumor. Ivy GAP gives researchers around the world a foundational resource for exploring the anatomic and genetic basis of glioblastoma at the cellular and molecular levels.

The Machine Intelligence from Cortical NetworkS (MICrONS) program

Figure 3. This image from the Allen Institute shows several different mouse neurons virtually reconstructed in 3D, which reveal the complexity of tracing the shapes and branching axons and dendrites within a small piece of the brain.

Figure 3. This image from the Allen Institute shows several different mouse neurons virtually reconstructed in 3D, which reveal the complexity of tracing the shapes and branching axons and dendrites within a small piece of the brain.

MICrONS, an IARPA-funded collaboration between the Allen Institute, Princeton University, and Baylor College of Medicine, with public data storage support by BossDB, AWS, and Google, aims to improve ML by reverse engineering the greatest learner of all: the brain.

The recently released MICrONS dataset maps the fine structures and connectivity of 200,000 brain cells and close to 500 million synapses all contained in a cubic millimeter chunk of mouse brain—approximately the size of a grain of sand—from the visual neocortex, the part of the mammalian brain that processes what we see. Many of these cells had never been captured in their complete form before.

While the goal of MICrONS was to collect and mine brain-wiring information to improve machine learning, the resulting dataset also serves as a rich resource for neuroscientists studying brain circuitry, and for those studying disorders of brain connectivity such as Parkinson’s disease and schizophrenia.

The Allen Institute for Cell Science

The Allen Cell Imaging Collection

Figure 4. This image, available with Quilt Data through an AWS partnership, shows the endoplasmic reticulum illuminated in human iPS cells as part of the Allen Cell Collection.

Figure 4. This image, available with Quilt Data through an AWS partnership, shows the endoplasmic reticulum illuminated in human iPS cells as part of the Allen Cell Collection.

The Allen Cell Imaging Collection includes four datasets that together allow scientists to track the behavior of stem cells undergoing differentiation. First, the Allen Cell Gene Editing team genetically engineers cell lines that express fluorescently labeled proteins to tag certain cell structures. These cell lines are then imaged, processed, and analyzed together in order to understand overall cellular structure, organization, and behavior. For the same cell samples, the Allen Institute has produced field of view images from glass plates; annotated segmentation and contouring of the cell membrane, DNA, and cell structures; and finally, machine learning imaging predictions of segmentation and contouring.

“You have to start with good data to build good models. Part of generating good data is having high quality cells that are made in a standardized and well characterized way,” Ruwanthi Gunawardane, Ph.D., executive director of the Allen Institute for Cell Science, said in an article about the institute’s cell lines. “These cells also allow people to compare data from lab to lab. Everybody’s going to benefit, not just us.”

Allen Institute for Artificial Intelligence (AI2)

Allen Institute Benchmark Datasets for Machine Learning

The Allen Institute for Artificial Intelligence (AI2), an independent nonprofit organization, builds artificial intelligence (AI) for the common good, producing breakthrough research and tools that move the needle in AI, empower the research community, and benefit society. To that end, AI2 produces benchmarking datasets for natural language processing (NLP) tasks, ranging from reading comprehension to diagram interpretation. AI2 collaborates with the AWS Open Data Program to distribute these datasets, which are accessed by a global population of ML enthusiasts.

COVID-19 Open Research Dataset (CORD-19)

At the start of COVID-19 pandemic, there was a rush for information about all things COVID-19—its mechanism of action, potential solutions for prevention, drug targets for therapy, risk factors, and prognostic indicators—leading to an estimated 200,000 COVID-19 related preprints and papers published in 2020.

CORD-19 represents a White House-driven effort to facilitate the development of ML algorithms that can systematically text, mine, and retrieve information around this wealth of COVID-19 data, with the intent to draw insights that could inform public health policy. Since its release in April 2020, CORD-19 has been cited in over 1,000 research publications that pose incredibly diverse research questions. CORD-19 has also served as the foundation for enriched datasets such as the Imperial College London’s REDASA, also available on AWS.

Ready to get started? Plug into Open Science Week and join the conversation on social media using the #OpenScienceWeek hashtag. If you’re ready to start exploring Allen Institute and Allen Institute for Artificial Intelligence datasets, head over to the Registry of Open Data on AWS.

Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.

Jenny Burns

Jenny Burns

Jenny joined the Allen Institute In 2017. As digital content manger, in the Communications department, she manages website content for alleninstitute.org and oversees the institute’s social media. Jenny is passionate about engaging audiences with digital content and supporting mission-driven objectives that benefit humankind. In 2015, she received a Master of Communication in Communities and Networks degree from the University of Washington’s Communication Leadership program.

Erin Chu

Erin Chu

Erin Chu is the life sciences lead on the Amazon Web Services (AWS) open data team. Trained to bridge the gap between the clinic and the lab, Erin is a veterinarian and a molecular geneticist, and spent the last four years in the companion animal genomics space. She is dedicated to helping speed time to science through interdisciplinary collaboration, communication, and learning.