This Guidance demonstrates how to build a scientific data management system that integrates both laboratory instrument data and software with cloud data governance, data discovery, and bioinformatics pipelines, capturing key metadata events along the way. It starts once an experiment or project is initiated and electronic lab notebooks (ELNs) or lab information management systems (LIMS) notify a metadata catalog on AWS. After data is collected from instruments, the data moves to a data store that is associated with the metadata catalog. Bioinformatics results are captured in the data store, with all new files linked to the ELN or LIMS through the metadata store. All data is governed and discoverable by searching the metadata to find data assets, or by configuring a natural language search with a chat interface.

Please note: [Disclaimer]

Architecture Diagram

Download the architecture diagram PDF 
  • Overview
  • This architecture diagram shows an overview about how you can accelerate the launch of a scientific data management system that integrates both your laboratory instruments and software with cloud data governance, data discovery, and bioinformatics pipelines, capturing key metadata events along the way. For more details on each component, open the other tab.

  • Main Architecture
  • This architecture diagram shows the main architecture and provides more details about each component. For more details and architectural considerations, visit the Implementation Guide.

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

  • This Guidance was configured with Amazon API Gateway and Step Functions, two AWS services that are purpose-built to help you run and monitor your research systems effectively, gain insights into operations, and continually improve your processes. Specifically, API Gateway creates RESTful APIs to enable two-way communication between AWS and your lab software. It acts as a front door for lab software and AWS to share logic and metadata, which enables up-to-date contextualization of datasets in ELN and AWS. Step Functions is a visual workflow service to automate microservice processes between the data store, metadata store, and lab software, creating an orchestration event that removes the need for manual updates of metadata, and keeps the ELN, the data store, and the metadata store in sync with one another.

    Read the Operational Excellence whitepaper 
  • Amazon DataZone and Storage Gateway work in concert to improve your security posture, protecting your data, systems, and assets. Amazon DataZone lets users access data in accordance with their organization’s security and compliance regulations, providing unified access controls to scientific data across multiple data domains and third-party data stores. Storage Gateway supports data integrity efforts with encryption, audit logging, and write-once, read-many (WORM) storage from on-premises applications to the data mesh. It provides lab users access to cloud-backed files for use in report generation or local analysis, while making it easy to maintain metadata tagging in the data mesh. 

    Read the Security whitepaper 
  • Amazon S3 and DataSync are built to ensure your workloads perform their intended functions correctly and consistently while allowing you to recover quickly from failure. Amazon S3 is a highly available and durable object store with cross-Region options for global organizations. DataSync provides managed data transfer with advanced features, including bandwidth throttling, migration scheduling, task filtering, and task reporting. By liberating data from on-premises file stores, DataSync and Amazon S3 provide a reusable transfer and storage architecture that can scale from small to large.

    Read the Reliability whitepaper 
  • AWS Batch and HealthOmics both help you monitor performance and maintain efficiency for your workloads as business needs evolve. AWS Batch offers a flexible, high-performance computing configuration and virtually unlimited scale, allowing bioinformatics groups to tune and scale infrastructure as life science workloads dictate. It brings instant access to virtually unlimited computing resources to accelerate genomics, proteomics, cell imaging, electron microscopy, and high throughput simulation.

    HealthOmics allows for Ready2Run workflows or bring-your-own private bioinformatics workflows to simplify the deployment of high-performance compute workflows. It includes pre-built workflows designed by industry-leading third-party software companies along with common, open-source pipelines to help you get started quickly.

    Read the Performance Efficiency whitepaper 
  • Amazon S3 Intelligent-Tiering storage class delivers automatic storage cost savings when data access patterns change through the lifecycle of instrument data, allowing for automatic cost savings that align with the way that scientific data is used. For example, you can move raw instrument data to lower access frequency storage classes once that data has been processed. Another way cost is optimized with this Guidance is with HealthOmics sequence stores. These are genomics-aware data stores that support large-scale analysis and collaborative research across entire populations, reducing long-term storage costs by automatically moving data objects that have not been accessed within 30 days to an archive storage class.  HealthOmics also supports petabytes of omics data to be stored efficiently and cost effectively, allowing scientific discovery at population scale. 

    Read the Cost Optimization whitepaper 
  • DataSync and HealthOmics work in tandem to minimize the environmental impacts of running cloud workloads. For example, DataSync rapidly migrates instrument files to the cloud for data storage and archival, relieving the need for an expanding on-premise data center. And, HealthOmics automatically provisions and scales your compute infrastructure, removing the need to manage servers and giving unused compute services back to the service, reducing the amount of wasted resources.

    Read the Sustainability whitepaper 

Implementation Resources

A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.

Healthcare & Life Sciences
Guidance

Guidance for Development, Automation, Implementation, and Monitoring of Bioinformatics Workflows on AWS

This Guidance shows how you can build and run production-grade bioinformatics workflows at scale. 
Healthcare & Life Sciences
Guidance

Guidance for Digital Connected Lab on AWS

This guidance helps you learn how to transfer life sciences data files to the cloud and provide data access using Amazon Web Services (AWS).

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.

Was this page helpful?