AWS for Industries

Scientific Data Management on AWS with Open Source Quilt Data Packages

End-to-end scientific data management on the cloud is an area of increasing interest, as life sciences companies aim to increase the utility of their data. Research, development, and manufacturing teams need a holistic set of controls for the scientific data lifecycle. These controls include data integrity, metadata tracking, lineage, permissions, and security, while simultaneously allowing for the increase in agility and performance benefits of the cloud.

In this blog, we explore the concept of data packages for managing scientific data, both for wet-lab and computational data. Data packages are logical collections that can include any number of objects, metadata annotations, charts, and explanatory documentation in reusable and linkable units. We show how packaging data can preserve data integrity, lineage, and metadata without relying on complex naming conventions and folder hierarchies. Because data packages are linkable, they serve as points of integration and data exchange between applications, further increasing agility.

We will discuss how using AWS and packaging data with the open source Quilt-SDK shorten the time to deploy a scientific data management system to the cloud. Biotech, pharma, and agtech companies use AWS and Quilt to deploy systems that manage their scientific data’s complete journey, from instrument, to processing, analysis, and finally to scientific reports and filings. Quilt Data Packages enable leaders, scientists, and analysts to increase visibility and trust across the scientific data lifecycle and are an accelerator for a Digital Labs Strategy.

Overview

In this post we describe how companies can:

  1. Set up Amazon Simple Storage Service (S3) for scientific data management
  2. Improve data visibility and trust by packaging data
  3. Move on-premises data into AWS Cloud while capturing metadata
  4. Leverage event-driven automation to connect applications

We share a case study of how Inari Agriculture uses Quilt Data Packages and AWS to accelerate product development and commercialization.

Setting up Amazon Simple Storage Service (S3) for scientific data management

Scientific data can progress across lifecycle stages that may include a raw stage (for original data from instruments and CROs), a processed stage (for data that has been analyzed with scientist-facing apps or pipelines), and final stage (for production artifacts linked to the electronic lab notebooks (ELNs) and lab information management systems (LIMS)). The object storage system Amazon S3 has become a natural choice for modern scientific data management ranging from unstructured instrument files, to public datasets, and small, structured office files.

When configuring data lifecycles in Amazon S3 buckets, the following features should be enabled. These are central to a scientific data integrity plan:

  • S3 Object Versioning tracks changes to objects for auditing, compliance, and debugging. Provides some protection against deletion and data loss via delete markers.
  • S3 Object Lock ensures that critical files cannot be deleted by enforcing a “Write Once, Read Many” (WORM) model.
  • S3 Checksums ensure data integrity by giving each file a unique, verifiable digital fingerprint. For purposes of compliance and auditing we recommend SHA-256.
  • S3 Lifecycle Configuration offers controls for cost-savings and compliance by transitioning objects to more cost-effective storage tiers, or by deleting objects according to user-customizable rules.
  • S3 Data Encryption offers the ability to protect your data while in transit and at rest.
  • AWS Key Management Service lets you create, manage, and control cryptographic keys across your applications and AWS services.
  • Amazon CloudTrail provides audit trail and usage logs on S3 buckets and can be connected to Amazon Athena for querying.

These are some key capabilities for data integrity of scientific data on Amazon S3. More information is available on how to enable GxP data integrity and how to build the foundations for GxP-regulated workloads.

Ensuring logical organization through data packages

Life sciences datasets are highly heterogeneous in format, audience, and application. Additionally, scientific data, results, and reports are rarely captured in a single file or archive. As a result, there is no single folder structure that works for large, cross-functional teams. To overcome this heterogeneity, researchers need to be able to distinguish between lightweight logical views of data and durable physical locations for storage. This facilitates finding the data you need, no matter where it is stored. Data needs to be organized in a way that can include a variety of files and objects, data types, and locations to create logical packages based on common attributes.

Quilt Data Packages store metadata adjacent to raw data in an open format that is accessible to data analysts, through AWS Athena and other data services. In contrast to common practices such as implicitly capturing metadata in deep folder path hierarchies, or in detached spreadsheets, metadata in Quilt is both durable and queryable. Quilt Data Packages also have a concept of a revision which includes a top-hash (that is, a top-level checksum). The top-hash serves as a digital fingerprint that data producers and data consumers can use to verify the integrity of every data package at every phase of the scientific data lifecycle. The cryptographic top-hash is one level above S3 checksums and has the benefit of encompassing multiple objects and metadata, while S3 checksums encompass individual objects only.

Data packages, like objects, can be moved across buckets with lifecycle operations that reflect how complete the packages are as they transition from raw instrument stages, to processed analysis, to sealed final packages that are linked to project milestones, findings, and filings.

Moving scientific data and metadata from on premises to the cloud

Cloud-generated datasets and on-premises datasets can be captured as Quilt packages with the help of AWS services. Amazon DataSync transfers large datasets and is ideal for rapidly moving your instrument generated files to AWS. AWS Storage Gateway, together with Amazon S3 events, transparently synchronizes on-premise storage to S3 buckets, creating a familiar interface for ad hoc dataset creation that looks to end-users like files and folders on a local machine. To provide near real-time availability of instrument data to scientists, you can use AWS Lambda and AWS IoT Greengrass to start transfers as soon as new files are available. For further instrument-to-cloud design patterns refer to the Life Sciences Data Transfer Architecture or the webinar “Get Life Sciences Data to the Cloud”.

Quilt Data Platform

Quilt Data Platform is a comprehensive data management platform that helps organizations centralize and manage data from a variety of sources, including structured, semi-structured, and unstructured data. It improves collaboration through its intuitive web-based catalog with an in-platform document editor and visualizations. Embedded support for tools like Integrative Genomics Viewer (IGV), Apache ECharts, Voila, and customizable Javascript visualizations assists customers in creating tailored documentation. It also enhances findability through its advanced search capabilities, making search across terabytes or petabytes of data and associated metadata rapid and intuitive.

Quilt Data Platform implements data version control and traceability using Quilt Data Packages to ensure data integrity and provenance. As a result, Quilt Data Platform ensures a shared source of truth for all members of a life sciences organization, whether those are bench scientists, computational biologists, or business stakeholders. Quilt Data Platform helps cross-functional teams to make critical decisions with their data using data visualization, embedded documentation data, data verification, and lineage available through a web application, python SDK and API.

Data discovery and cataloging

A prime motivation for moving scientific data to the cloud is to reduce data silos. Having a centralized data discovery tool for collaboration across experimental (wet-lab) and computational (dry-lab) science teams, within the same enterprise, and across organizations brings the ability to think about the scientific subject holistically and ask novel questions of the data.

To discover and interact with datasets within packages, Quilt Data Platform has search and discovery functions that are powered by three key technologies:

  • Amazon OpenSearch Service for full text search of data, metadata, and packages in Amazon S3. Within the graphical user interface, this lets users search for “BRCA1” across all data.
  • Amazon Athena for SQL queries of metadata. Within the graphical user interface, this lets users retrieve a table of all files that came from a given CRO, from a given ELN, or from a given instrument.
  • Amazon API Gateway for programmatic integration of Quilt packages into your own applications using APIs .

With these options, coders and non-coders alike can take advantage of Quilt packages as the anchor point for findability.

Connecting systems in a laboratory data mesh

Quilt packages can be a central component to a laboratory data mesh, which integrates laboratory systems, data, and bioinformatics pipelines. Quilt packages can do this through consuming and producing events in the cloud.

For example, whenever a user creates a new experiment in their ELN, Amazon API Gateway or Amazon EventBridge can trigger the creation of a new Quilt package. As a result, each ELN experiment is backed by a versioned data package that contains the full data context, including instrument files that are too large to accommodate in ELNs or document stores.

Event-driven data packaging improves the quality and discoverability of your scientific datasets by enabling your pipelines to annotate your data packages with the same unique identifiers and metadata found in your ELN. Conversely, Amazon EventBridge events from Quilt and S3 can send notifications to an external ELN or LIMS when new data arrives in S3 to ensure that the ELN and LIMS systems are linked to your latest and most complete data.

The Quilt packaging APIs are fully open source so that your business processes are insulated from proprietary file formats and software dependencies that may change or break over time. In the interest of customer autonomy, we recommend that users store all data in open file formats, accessible with open APIs, so that your intellectual property remains vendor-neutral.

Data governance

In addition to the Quilt Data Platform managing lifecycle stages indicating how “complete” data is and top-hashes that verify the integrity of your datasets, it supports the compliance requirements that govern who can access which data, and data retention rules. In addition to using AWS Identity and Access Management (IAM) and resource policies for role- and attribute-based access control, you can use AWS Lake Formation for fine-grained permissions (e.g. column-level access control) to tables and objects in Amazon S3. For additional layers of governance, you can use AWS DataZone to route access requests to your data stewards for approval.

Case Study: How Inari Agriculture accelerated product development and commercialization

Inari Agriculture is an AgTech company focused on new plant breeding technology to design seeds for a more sustainable global food system. Inari maintains large amounts of laboratory and field data and is constantly evolving its scientific data management on the cloud to meet new product development and commercialization goals.

From product concept through design, selection, and validation, large amounts of scientific data need to be captured and made available to collaborating teams. Quilt Data Platform enables Inari to share versioned data packages with embedded  reports and rich metadata on a single platform, which is easily consumed by both programmers and web users.

Quilt’s versatility empowers individual teams to define their workflows, gather and curate data and reports from research, to lab, to field, in varied ways and share them seamlessly without a large amount of central coordination.

Getting started

Quilt Data Packages can be adopted by your organization with free-to-use open source libraries. Developers can start creating packages from the command line, or by using the open source quilt3 client. These Quilt packages can be used in a local or cloud environment and interact with other AWS services as in Figure 1.

Figure 1: Overall concept showing how Quilt Data integrates with on-premises scientific lab resources to cloud storage

Figure 1: Overall concept showing how Quilt Data integrates with on-premises scientific lab resources to cloud storage and the required compute for analysis. The Quilt Data Platform creates an easy way to keep track of your scientific data throughout its lifecycle.

For customers seeking more off-the-shelf functionality to work with Quilt packages, the Quilt Data Platform, a commercial offering available on AWS Marketplace, provides the Quilt Catalog, a managed graphical user experience for your organization with single sign on (SSO), search, and query functions. This is a private data portal that runs in your VPC, and is called a Quilt instance.

The installation of a Quilt instance consists of a AWS CloudFormation stack that is privately hosted in your AWS account. The stack includes backend services for the catalog, S3 proxy, SSO, user identities and IAM policies, an Amazon OpenSearch cluster, and more. The Quilt CloudFormation template will automatically configure appropriate instance sizes for Amazon RDS, Amazon ECS on AWS Fargate, AWS Lambda and Amazon OpenSearch Service. Users may choose to adjust the size and configuration of their OpenSearch cluster. You may host the Quilt instance in your own Amazon Virtual Private Cloud (VPC), or have the Quilt stack create its own VPC.

Conclusion

Quilt Data Packages can accelerate the creation of a laboratory data mesh on AWS, by creating logical data organization, metadata management, and data integrity controls through the scientific data lifecycle. Moreover, packages enable you to adopt a vendor agnostic approach, as packages are in an open format and compatible across vendor file types. Finally, Quilt packages are suited for data management at scale, by integrating with Amazon S3 as the underlying data store. You can begin working today with the quilt3 client, or contact your Quilt Data or AWS representative to get started.

Related resources

Aneesh Karve

Aneesh Karve

Aneesh Karve is Chief Technology Officer at Quilt Data. His academic background spans visualization, number theory, and chemistry. Aneesh has 15 years of experience managing and delivering software across startups and enterprises including YCombinator, Matterport, and NVIDIA. He currently focuses on the interface between humans and data, with the goal of producing shared understanding across large, cross-functional teams in drug discovery.

Ariella Sasson

Ariella Sasson

Ariella Sasson, Ph.D., is a Principal Solutions Architect specializing in genomics and life sciences. Ariella has a background in math and computer science, a PhD in Computational Biology, and over decade of experience working in clinical genomics, oncology, and pharma. She is passionate about using technology and big data to accelerate HCLS research, genomics and personalized medicine.

Lee Tessler

Lee Tessler

Lee Tessler, Ph.D. is a Principal Technology Strategist for the Healthcare & Life Sciences industry at AWS. His focus is on cloud architectures for modernizing R&D, clinical trials, manufacturing, and patient engagement. Prior to joining AWS, he launched products in the areas of bioinformatics, drug discovery, diagnostics, lab instruments, and pharma manufacturing. Lee holds a Ph.D. in computational biology from Washington University in St. Louis and Sc.B. from Brown University.