The Boss: A Petascale Database for Large-Scale Neuroscience Powered by Serverless Technologies
The Intelligence Advanced Research Projects Activity (IARPA) Machine Intelligence from Cortical Networks (MICrONS) program seeks to revolutionize machine learning by better understanding the representations, transformations, and learning rules employed by the brain.
We spoke with Dean Kleissas, Research Engineer working on the IARPA MICrONS Project at the Johns Hopkins University Applied Physics Laboratory (JHU/APL), and he shared more about the project, what makes it unique, and how the team leverages serverless technology.
Could you tell us about the IARPA MICrONS Project?
This project partners computer scientists, neuroscientists, biologists, and other researchers from over 30 different institutions to tackle problems in neuroscience and computer science towards improving artificial intelligence. These researchers are developing machine learning frameworks informed and constrained by large-scale neuroimaging and experimentation at a spatial size and resolution never before achieved.
Why is this program different from other attempts to build machine learning based on biological principles?
While current approaches to neural network-based machine learning algorithms are “neurally inspired,” they are not “biofidelic” or “neurally plausible,” meaning they could not be directly implemented using a biological system. Previous attempts to incorporate the brain’s inner workings into machine learning have used statistical summaries of properties of the brain or measurements at low resolution (brain regions) or high resolution (individual neurons or populations of 100’s-1k neurons).
With MICrONS, researchers are attempting to inform machine learning frameworks by interrogating the brain at the “mesoscale,” the scale at which the hypothesized unit of computation, the cortical column, should exist. Teams will measure the functional (how a neuron fires) and structural (how neurons connect) properties of every neuron in a cubic millimeter of mammalian tissue. While a cubic millimeter may sound small, these datasets will be some of the largest ever collected and will contain about 50k-100k neurons and over 100 million synapses. On disk, this results in roughly 2-3 petabytes of image data to store and analyze per tissue sample.
To manage the challenges created by both the collaborative nature of this program and massive amounts of multi-dimensional imaging, the JHU/APL team developed and deployed a novel spatial database called the Boss.
What is the Boss and some of its key features?
The Boss is a multi-dimensional spatial database provided as a managed service on AWS. It stores image data of different modalities with associated annotation data, or the output of an analysis that has labeled source image data with unique 64-bit identifiers. The Boss leverages a storage hierarchy to balance cost with performance. Data is migrated using AWS Lambda from Amazon Simple Storage Service (Amazon S3) to a fast in-memory cache as needed. Image and annotation data is spatially indexed for efficient, arbitrary access to sub-regions of peta-scale datasets. The Boss provides Single Sign-On authentication for third-party integrations, a fine-grained access control system, built in 2D and 3D web-based visualization, a rich REST API, and the ability to auto-scale with varying load.
The Boss is able to auto-scale by leveraging serverless components to provide on-demand capacity. Since users can choose to perform different high bandwidth operations, like data ingest or image downsampling, we needed the Boss to scale to meet each team’s needs and also remain affordable and operate within a fixed budget.
How did your team leverage serverless services when building the data ingest system for the Boss?
During ingest, we move large amounts of data (ranging from terabytes to petabytes) from on-premises temporary storage into the Boss. These data are image stacks in various formats stored locally in different ways. The job of the ingest service is to upload these image files while converting them into the Boss’ internal 3D data representation that allows for more efficient IO and storage.
Since these workflows can be spikey, driven both by researcher’s progress and program timelines, we use serverless services. We do not have to maintain running servers when ingest workflows are not executing and can massively scale processing for short periods of time, on-demand.
We use Amazon S3 for both the temporary storage of image tiles as they are uploaded and the final storage of compressed, reformatted data. Amazon DynamoDB tracks upload progress and maintains indexes of reformatted data stored in the Boss. Amazon Simple Queue Service (SQS) provides scalable task queues so that our distributed upload client application can reliably transfer data into the Boss. Step Functions manages high-level workflows during ingest, such as populating task queues and downsampling data after upload. After working with Step Functions and finding the native JSON scheme challenging to maintain, we created an open source Python package called Heaviside to manage Step Function development and use. AWS Lambda provides scalable, on-demand compute to monitor and update ingest indexes, process and index image data as it is loaded into the Boss, and downsample data for visualization after ingest is complete. By leveraging these services we have been able to achieve sustained ingest rates of over 4gbps from a single user while managing our overall monthly costs.
Thanks for sharing, Dean! Learn more about the system, by watching Dean’s session at the AWS Public Sector Summit here.
Attending re:Invent 2017? Don’t miss this session from Dean. Save your seat!