Guidance for a Laboratory Data Mesh on AWS

Open guide

Overview

This Guidance demonstrates how to build a scientific data management system that integrates both laboratory instrument data and software with cloud data governance, data discovery, and bioinformatics pipelines, capturing key metadata events along the way. It starts once an experiment or project is initiated and electronic lab notebooks (ELNs) or lab information management systems (LIMS) notify a metadata catalog on AWS. After data is collected from instruments, the data moves to a data store that is associated with the metadata catalog. Bioinformatics results are captured in the data store, with all new files linked to the ELN or LIMS through the metadata store. All data is governed and discoverable by searching the metadata to find data assets, or by configuring a natural language search with a chat interface.

How it works

Overview

This architecture diagram shows an overview about how you can accelerate the launch of a scientific data management system that integrates both your laboratory instruments and software with cloud data governance, data discovery, and bioinformatics pipelines, capturing key metadata events along the way. For more details on each component, open the other tab.

Download the architecture diagram

Diagram illustrating the AWS Laboratory Data Mesh architecture, showing how scientific data moves from on-premises lab instruments and systems to AWS Cloud. It highlights governance and data discovery with Amazon Kendra and Amazon DataZone, connectivity via AWS API Gateway and Lambda, data movement using AWS DataSync and Amazon S3, and bioinformatics pipelines using Amazon HealthOmics and AWS Step Functions.

Main architecture

This architecture diagram shows the main architecture and provides more details about each component. For more details and architectural considerations, visit the Implementation Guide.

Download the architecture diagram

Architecture diagram illustrating the main components and data flows of an AWS-based laboratory data mesh, showing integration between on-premises lab systems, AWS Cloud services like S3, Lambda, DataSync, Step Functions, HealthOmics, DataZone, and Kendra for bioinformatics and life sciences data management.

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

This Guidance was configured with Amazon API Gateway and Step Functions, two AWS services that are purpose-built to help you run and monitor your research systems effectively, gain insights into operations, and continually improve your processes. Specifically, API Gateway creates RESTful APIs to enable two-way communication between AWS and your lab software. It acts as a front door for lab software and AWS to share logic and metadata, which enables up-to-date contextualization of datasets in ELN and AWS. Step Functions is a visual workflow service to automate microservice processes between the data store, metadata store, and lab software, creating an orchestration event that removes the need for manual updates of metadata, and keeps the ELN, the data store, and the metadata store in sync with one another.

Read the Operational Excellence whitepaper

Amazon DataZone and Storage Gateway work in concert to improve your security posture, protecting your data, systems, and assets. Amazon DataZone lets users access data in accordance with their organization’s security and compliance regulations, providing unified access controls to scientific data across multiple data domains and third-party data stores. Storage Gateway supports data integrity efforts with encryption, audit logging, and write-once, read-many (WORM) storage from on-premises applications to the data mesh. It provides lab users access to cloud-backed files for use in report generation or local analysis, while making it easy to maintain metadata tagging in the data mesh.

Read the Security whitepaper

Amazon S3 and DataSync are built to ensure your workloads perform their intended functions correctly and consistently while allowing you to recover quickly from failure. Amazon S3 is a highly available and durable object store with cross-Region options for global organizations. DataSync provides managed data transfer with advanced features, including bandwidth throttling, migration scheduling, task filtering, and task reporting. By liberating data from on-premises file stores, DataSync and Amazon S3 provide a reusable transfer and storage architecture that can scale from small to large.

Read the Reliability whitepaper

AWS Batch and HealthOmics both help you monitor performance and maintain efficiency for your workloads as business needs evolve. AWS Batch offers a flexible, high-performance computing configuration and virtually unlimited scale, allowing bioinformatics groups to tune and scale infrastructure as life science workloads dictate. It brings instant access to virtually unlimited computing resources to accelerate genomics, proteomics, cell imaging, electron microscopy, and high throughput simulation.

HealthOmics allows for Ready2Run workflows or bring-your-own private bioinformatics workflows to simplify the deployment of high-performance compute workflows. It includes pre-built workflows designed by industry-leading third-party software companies along with common, open-source pipelines to help you get started quickly.

Read the Performance Efficiency whitepaper

Amazon S3 Intelligent-Tiering storage class delivers automatic storage cost savings when data access patterns change through the lifecycle of instrument data, allowing for automatic cost savings that align with the way that scientific data is used. For example, you can move raw instrument data to lower access frequency storage classes once that data has been processed. Another way cost is optimized with this Guidance is with HealthOmics sequence stores. These are genomics-aware data stores that support large-scale analysis and collaborative research across entire populations, reducing long-term storage costs by automatically moving data objects that have not been accessed within 30 days to an archive storage class. HealthOmics also supports petabytes of omics data to be stored efficiently and cost effectively, allowing scientific discovery at population scale.

Read the Cost Optimization whitepaper

DataSync and HealthOmics work in tandem to minimize the environmental impacts of running cloud workloads. For example, DataSync rapidly migrates instrument files to the cloud for data storage and archival, relieving the need for an expanding on-premise data center. And, HealthOmics automatically provisions and scales your compute infrastructure, removing the need to manage servers and giving unused compute services back to the service, reducing the amount of wasted resources.

Read the Sustainability whitepaper

Implementation resources

A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.

Open implementation guide

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Guidance for a Laboratory Data Mesh on AWS

Overview

How it works

Overview

Main architecture

Well-Architected Pillars

Implementation resources

Related content

Guidance for Development, Automation, Implementation, and Monitoring of Bioinformatics Workflows on AWS

Guidance for Digital Connected Lab on AWS

Disclaimer

Did you find what you were looking for today?

Learn

Resources

Developers

Help