Collaborative Neuron Tracing on AWS

HHMI’s Janelia Research Campus in Ashburn, Virginia has an integrated team of lab scientists and tool-builders who pursue a small number of scientific questions with potential for transformative impact. To drive science forward, we share our methods, results, and tools with the scientific community.

Introduction

To study how the brain works, researchers often begin with neural anatomy: how do neurons traverse the brain and how do they form circuits? To answer these questions, a brain can be imaged in 3D using light microscopy and the neurons in the image volumes can be traced out to form a map of the brain’s neural network. Software tools for visualization and annotation of the image volumes enable annotators to follow the neurons through-out the brain and reconstruct their structures. However, tracing neurons in large image volumes relies on direct volume rendering requiring terabytes of data to be moved from storage to a graphics processing unit (GPU). In addition, moving the data across the internet to collaborating institutions can take weeks. To address these issues, we moved the large brain image volumes to AWS and brought the users to the data using Amazon AppStream 2.0. The cloud-based architecture enables our application to perform 3D visualization of this complex scientific data completely within the cloud, streaming interactive scientific visualizations to users around the world.

HortaCloud Animation

Figure 1: Neurons in a mouse brain labeled with fluorescent markers with overlaid annotations
(data collected and annotated by the MouseLight Project Team, Janelia Research Campus)

Tracing 85 meters of neurons

Advances in light microscopy are producing larger images at finer resolution. Bioimaging techniques such as serial two-photon tomography and light sheet microscopy can produce image volumes that not only capture large sections of tissue such as whole mouse brains, but do so at multiple wavelengths, resulting in several fluorescent signals for each spatial voxel. These volumes are often imaged and reviewed slice-by-slice, but are ultimately reconstructed into complete 3-dimensional (3D) volumes, often many terabytes in size.

Light microscopy imaging typically produces relatively sparse image volumes, which can therefore be visualized using direct volume rendering. With this type of visualization you can view through the entirety of the data volume, not only a surface rendering (Figure 1). However, it requires the entire volume to be loaded into GPU memory at once, demanding powerful hardware, and fast access to large data sets.

The MouseLight team project at Janelia Research Campus built the Horta desktop application to provide visualization and collaborative annotation of large (~30 TB) mouse brain volumes. Using this software, annotators on the team worked together to reconstruct long-range axonal projections of more than 1000 neurons, many traversing the entire mouse brain from end-to-end. The Horta application was built using Java and OpenGL, with a feature-rich thick client that relies on a set of backend microservices for data access and persistence. We initially deployed this application on-premises for internal users. It worked well but required extensive DevOps expertise to deploy and manage – expertise that is not common in biology labs. In the spirit of Open Science, we asked ourselves, “how can we make these powerful tools available to other research groups for software reuse and cross-institute collaboration?”

Moving users to data with HortaCloud

Moving these tools into AWS was the logical next step for the following reasons:

Large data sets take a long time to transfer — in some cases, shipping hard drives was faster than waiting on a data transfer that could take months. They also consume large amounts of disk space, especially when duplicated in multiple places. Rather than copying the data, it makes more sense to host the data in the cloud and bring the users to the data.
We can avoid data transfer costs by doing all of the data visualization in the cloud. Our terabytes of image data do not have to leave the AWS network.
Supporting client/server software systems on arbitrary hardware configurations complicates deployment and increases the cost of maintenance. In the cloud, deployment and operation of the system is predictable and manageable. We have complete control of the instance types, GPU cards, storage devices, and other infrastructure.
Many potential users don’t want to invest in the expensive hardware necessary to run the software. In the cloud, they can lease as much hardware they need for as long as they need it. With Amazon AppStream 2.0, they can access the application from any modern web browser, without the need for any software installation.
Under the Open Data Sponsorship Program, AWS covers the cost of storing the MouseLight’s brain image data sets on Amazon S3. These data sets are an extremely valuable resource for the neuroscience community and contain extensive information that has yet to be extracted.
AWS storage services provide a high level of data durability and scale automatically with our data size. This lets us focus our efforts on scientific use cases, instead of worrying about infrastructure.

With these motivations in mind, we built HortaCloud, an AWS based deployment of the Horta application for collaborative neuron tracing.

Figure 2: HortaCloud architecture

Deployment architecture

HortaCloud relies on Amazon AppStream 2.0’s GPU capability to render interactive scientific visualizations. The Horta client renders the 3D volumes into a 2D representation, and then AppStream 2.0 streams the 2D representation to a user working in a web browser. This approach allows us to keep large data in the cloud where it can be moved quickly and efficiently.

Let’s take a closer look at how the architecture works (Figure 2). The 3D volume data lives on Amazon S3, mostly in Open Data buckets. The image volumes are served through our backend services running on one or more Amazon EC2 instances. Importantly, the EC2 instances are deployed on a private subnet inside of an Amazon Virtual Private Cloud (VPC). With this configuration, our backend services are secure and not accessible to the internet, yet they are able to access S3 through an S3 VPC Endpoint. These backend services implement the state management of our system, including a Mongo database for storing user annotations and RabbitMQ instance for asynchronous messaging between users. The backend services are orchestrated using Docker Swarm, and can be scaled across multiple EC2 instances.

We deploy the Horta client application on Amazon AppStream 2.0 (Figure 3). AppStream 2.0 auto-scales GPU instances based on usage, so that we pay only for GPUs we use. To run the Horta client, we use a “stream.graphics-pro.4xlarge” instance, which gives us both a GPU and the extra RAM required to cache images for a more responsive user experience. The AppStream 2.0 instances are deployed in the same private subnet as the services. The instances have internet access through a NAT gateway so that files can be moved to and from the instance using OneDrive integration.

In order to provide our scientific end users with a seamless experience, we also integrated Amazon Cognito authentication with our underlying system. We built a React-based website where users log in to the system via Amazon Cognito. By using AWS Lambda to automate creation of the streaming URL based on the Cognito credentials, we were able to give users the ability to launch an AppStream 2.0 instance running Horta with one button click. The website also has pages for administrators so that they can easily manage users’ access to the system. These admin pages call AWS Lambda functions to synchronize user state between Cognito and our Mongo database.

This entire infrastructure is automatically provisioned and configured using the AWS CDK. We used the CDK to build a command line deployment tool using TypeScript that creates the entire infrastructure on AWS and automatically deploys our software. This approach simplifies deployment for anyone who wants to run their own HortaCloud instance. It’s much easier than deploying the software on bare metal, because we’re able to control the deployment environment to keep things consistent.

Figure 3: Screenshot of HortaCloud running in a web browser using AppStream 2.0

Conclusion

Data volumes will continue to grow as microscopy technology evolves, continuously yielding higher resolution and larger sample sizes. Moving large image volumes is fundamentally impractical, and becomes effectively impossible as data sizes climb into petabyte territory. Enabling efficient visualization and analysis of large image volumes requires moving users to the data, rather than moving data to the users. Our new HortaCloud architecture, enabled by AWS, demonstrates one effective way to move users to large data sets hosted in the cloud. It paves the way for future scientific applications with large data requirements.

We would like to express our thanks to AWS Solutions Architect Scott Glasser, and to Tiago Ferreira for their valuable input in producing this write-up. Special thanks to Emily Tenshaw for creating the animated visualization.

Source Code

All of the application code described in this article is open source and licensed for reuse:

The data and imagery are shared publicly on the Registry of Open Data on AWS.

Konrad Rokicki is the Software Engineering Manager at the Howard Hughes Medical Institute’s Janelia Research Campus.

Desktop and Application Streaming