Allen Cell Explorer Distributes Terabytes of Data on AWS to Empower Scientists Worldwide

With Quilt Data, an APN Advanced Technology Partner

Embracing Open Science to Democratize Scientific Exploration

Dedicated to understanding and predicting the behavior of cells, the Allen Institute for Cell Science believes in scientific transparency, accessibility, and reproducibility. The team’s mission is to make its comprehensive and large-scale research and image data available to scientists worldwide to drive collaboration and research.

“We want to share everything we do, even before it’s published. We want to put everything out there and share the data we use to drive internal work—such as software and tool development—and to help create resources, tools, and information for others,” says Rick Horwitz, Ph.D., executive director of the Allen Institute for Cell Science.

Miotic Stem Cell

Allen Cell Explorer presents the Integrated Mitotic Stem Cell as a data-driven model and visualization tool that captures a holistic view of human cell division for the first time.

Through its work, the Allen Cell Explorer produces terabytes (TB) of imaging and object data as well as enormous amounts of metadata. With its focus on accessibility and reproducibility, the team found themselves challenged to find a solution that could provide easy access to their data on a large scale.

"The complexity of our data, the amount of data we produce, and the metadata that surrounds it have made sharing and distributing data access challenging," says Horwitz. "For both internal users and the broader ecosystem, we wanted to find a solution to make our data easy to access, search, and redistribute. We wanted to make it simple for users to take advantage of integrations with data science tools such as Python and Jupyter and provide versioning features for seamless reproducibility.”

Making Data Accessible Through Seamless Data Distribution

The Allen Institute for Cell Science team began to search for a solution that would help the organization distribute its data and metadata. One of the team’s main goals was to find a solution that would ease both internal and external users’ struggles with importing data into their environments.

"We could not find a solution that was a good fit for what we specifically needed," says Jackson Brown, a research engineer at the Allen Institute for Cell Science. “We had trouble finding a system that was well-suited to handle the coupling of large files and the metadata attached.”

That changed when Brown became familiar with Quilt Data, an Amazon Web Services (AWS) Partner Network (APN) Advanced Technology Partner that provides a versioned data portal for Amazon Simple Storage Service (Amazon S3). "Early on, Quilt provided features we sought such as usage statistics, scalability, auto-documentation, versioning datasets, and ease-of-access. As the platform has evolved, more features have been added that continue to meet our needs,” says Brown.

After learning about Quilt, Brown tried out the solution and then brought it to his team. “I felt very optimistic after exploring the solution. When I brought it to the team, I received positive feedback. When I first reached out to Quilt, they were very receptive,” says Brown. “To this day, I talk to the team nearly every day and feel we’ve built an active collaboration with them.”

“Using Quilt on AWS, our data is much more discoverable and we see more usage and downloads of data than before. We believe this is just the beginning and that usage will continue to scale as more people become familiar with the data we provide.”

- Jackson Brown, research engineer at the Allen Institute for Cell Science

Using Quilt on AWS

Quilt Data, Inc. built and deployed the AWS infrastructure that the Allen Cell Explorer uses. Quilt consists of a web client, a Python client, and a suite of cloud services that summarize, visualize, version, and search data in Amazon S3. Quilt’s cloud services are deployed as an AWS CloudFormation stack that utilizes AWS services including AWS Lambda, Amazon Elasticsearch Service, Amazon Elastic Container Registry (ECR), Amazon Athena, and AWS CloudTrail. The Institute leverages the Quilt instance at https://open.quiltdata.com/ to manage and release terabyte-scale data pipelines. “With Quilt, we were up and running at a terabyte-scale right away,” says Brown.

At present, the Allen Institute for Cell Science offers more than 7 terabytes of data and over 288,000 objects through Quilt. “We have one primary production dataset on Quilt. When we take an image, it goes to this dataset,” says Brown. "We have other datasets that will soon be merged with the production dataset. Another reason I like Quilt is that it can be used across the board for production or research and development data. You can structure the data in any way you want. Having that flexibility is nice."

Quilt extends Amazon S3’s object versioning to snapshot large data sets so that collaborators have a common frame of reference, uniquely identified by a short signature. This capability is vital in the scientific world where data drives publications, predictions, and recommendations. Using Quilt on AWS, the Allen Institute is also able to take advantage of machine learning (ML) tooling to predict labels for its imaging collection with 80-90 percent accuracy, saving the team and its collaborators enormous amounts of time that would have been spent on manual labeling.

Providing an Unprecedented View into Cell Data at Scale

The Allen Institute for Cell Science has been using Quilt in production for about six months. Using Quilt on AWS, the Institute can upload its data with ease and make its data much easier to browse. "Absent Quilt, we weren't able to drive as much usage of our data as we would like due to the challenges we faced for distribution and searchability," says Brown. "Using Quilt, our data is much more discoverable and we see more usage and downloads of data than we have before. We believe this is just the beginning and that usage will continue to scale as more people become familiar with the data we provide."

The team also benefits from the ease of managing Quilt. “For my projects, I wanted to work with a data management system that doesn’t need a team of engineers to deploy and manage. Quilt provides that capability,” says Brown. “It feels as though we only need one person to manage it, and, even then, showing people how to use the system quickly gives them independence and ownership over it.”

"Quilt is straightforward to use," says Kimberly Metzler, scientific program manager at the Allen Institute for Cell Science. "It makes it easy to upload and connect our data and files to different systems and it is very user-friendly. It’s helping to standardize our data access across the institute.”

Today, the Institute is working with Quilt and AWS to make more of its datasets available via the AWS Open Data Registry, which covers the cost of storage for publicly available, high-value, cloud-optimized datasets. “The Allen Institute for Cell Science is the producing the highest quality cell images in the world. It’s thrilling to make this data accessible to researchers around the world.” says Aneesh Karve, chief technology officer at Quilt Data. “We were able to build Quilt faster thanks to AWS’ broad service line-up.”

Moving forward, the Allen Institute for Cell Science team plans to continue to optimize its use of Quilt and take advantage of the solution by uploading new data for external users. For scientists, having better access to the data Allen Cell provides opens up a wealth of opportunities to drive new research and increase collaboration.

The Institute’s leadership believes that tools like Quilt help them to embody their focus on open science and get employees more excited about the work they do. “When I interview employees and ask them why they came to our Institute, almost inevitably the top item on the list is the open science that we do,” says Horwitz. “People are very passionate about that.”

"We were very frustrated by these enormous datasets that others couldn't quite seem to get at easily,” says Horwitz. “Using Quilt, we've been able to solve that challenge.”

Allen Institute Logo_600x400

About the Allen Institute for Cell Science

The Allen Institute for Cell Science is a division of the Allen Institute, a nonprofit research institution. The organization uses diverse technologies and approaches at a large scale to study the cell and its components as an integrated system. Its 3D live cell imaging data of the major cell structures, tagged by genome-editing human stem cells, is used to develop predictive models of cell states and behaviors. One of the Allen Institute’s founding credos is open science; therefore, all of its data and methods at the Allen Cell Explorer are publicly available to scientists for research worldwide. 

Challenge

Dedicated to sharing its learnings and data, the Allen Institute for Cell Science produces terabytes of data on a regular basis. The organization found itself challenged to find a solution that could provide easy access to its data on a large scale.

Solution

The Allen Institute for Cell Science began using a solution from Quilt Data, an APN Technology Partner that provides a versioned data portal for Amazon Simple Storage Service (Amazon S3). At present, the Allen Institute for Cell Science offers more than 7 terabytes of data and over 288,000 objects through Quilt.

Benefit

Using Quilt on AWS, the Institute can upload its data with ease and make its data much easier to browse. The team also benefits from the ease of managing Quilt. Moving forward, the Allen Institute for Cell Science team plans to continue to optimize its use of Quilt and take advantage of the solution by uploading new data for external users. For scientists, having better access to the data Allen Cell provides opens up a wealth of opportunities to drive new research and increase collaboration.

About Quilt Data

Quilt Data’s mission is to empower organizations to make data-driven decisions faster by making the data in S3 discoverable, visual, versioned, and collaborative. Quilt handles both public and private data. Organizations can publish unlimited data, free of charge, on open.quiltdata.com. For private data, Quilt offers a Virtual Private Cloud license through the AWS Marketplace. Common use cases for Quilt include data-intensive projects in machine learning, predictive analytics, and data engineering.