Teaching the Allen Brain Observatory: technical challenges, cloud solutions
A guest post by Allen Institute Staff
The mission of the Allen Institute for Brain Science is to accelerate our understanding of how the human brain works in health and disease. As part of this mission, scientists collect massive amounts of data, which is publicly released to help accelerate research in the field of neuroscience. Massive datasets can be challenging to share, so the Allen Institute uses AWS to share them around the world.
Moving data between users can be a challenging, expensive process. Cloud-based compute environments like AWS allow users to bring analysis to the data instead. The Allen Institute has collaborated with the AWS Public Dataset Program to make data from the Allen Brain Observatory (which includes nearly 100TB of neurophysiology data representing tens of thousands of neurons in the mouse visual system) available to users in a public Amazon Simple Storage Service (Amazon S3) bucket. Interested users can spin up an Amazon Elastic Compute Cloud (Amazon EC2) instance and have access to that entire dataset in minutes, rather than spending weeks downloading (and duplicating) data locally. Through the AWS Public Dataset program, what started as an experimental side project to support the intensive Summer Workshop on the Dynamic Brain has grown to a broad platform for Allen Institute researchers to share data more generally.
Part of the goal of sharing open, big data resources is to help other scientists who want to use data in their own research. The Allen Institute for Brain Science team conducted a tutorial on the Allen Brain Observatory at Cosyne 2019, an international computational neuroscience conference. The Allen Brain Observatory explores the interactions between visual images, cortical representations of visual input, and behavioral responses. This data helps scientists understand how the brain processes shapes, colors, moving objects, and more.
The goal for the tutorial was to have the participants prepared to use the Allen Brain Observatory data and data analysis tools on their own to conduct research. However, setting up analytics software environments is complex and time-consuming. Users frequently need one-on-one guidance to make sure they install the correct languages and software dependencies, and this doesn’t scale to large tutorials of 100 or more participants. Even assuming that process goes well, the curriculum would require students to immediately begin downloading data, and most conference Wi-Fi systems cannot handle that amount of data transfer. Poor internet connectivity resulting from the sudden demand on network infrastructure hinders the workshop experience for everyone.
The AWS storage and compute infrastructure addresses these issues. Amazon SageMaker provides pre-built analytics environments that we customized to include our analysis tools and transparent access to our 100TB of physiology data stored in an S3 bucket. We built a lightweight authentication website that allows users with GitHub credentials to provision single-user SageMaker notebook instances. Within 10 minutes of logging in, approximately 70 tutorial participants had their own Jupyter notebooks open with a complete analysis environment, data included. Instead of giving every participant a small collection of example datasets to play with, each participant had access to the entire Allen Brain Observatory.
Because the tutorial participants were able to get started immediately, they were able to spend more time exploring the dataset, completing a mock experiment, and fully experiencing the scientific value of the data. They can then return to their labs and continue their work on the complete Allen Brain Observatory dataset – in AWS – in order to dive in and generate new insights into how the visual system works.
To learn more, visit brainscience.alleninstitute.org.