Automated genomic data processing with AWS HealthOmics & TetraScience

Data governance strategies based on FAIR Data principles can increase operational efficiency and accelerate innovation in R&D. Amazon Web Services (AWS), along with partners such as TetraScience, enable pharmaceutical and biotech customers to adopt FAIR data governance practices with solutions that help process and store large-scale data for research and diagnostic purposes.

This blog shows how AWS HealthOmics and TetraScience can make lab instrument and file datasets available for downstream analysis using FAIR principles. HealthOmics is a managed service for storing, querying, and analyzing genomics, transcriptomics, and other omics data.

TetraScience provides solutions to capture and enable centralized, standardized access to scientific data via the Tetra Scientific Data Cloud, a purpose-built solution consisting of three pillars: the Tetra Data Platform (TDP) an AWS-native cloud data solution, the Tetra Partner Network, and Tetra Catalysts (embedded analysts with deep domain expertise in both science and technology). These facilitate secure and reliable collection of data from a diverse range of laboratory instruments used in pharmaceutical and biotech organizations.

FAIR Data Principles for scientific data management and stewardship were published in 2016. FAIR was intended to break down data silos and provide guidelines for making data Findable, Accessible, Interoperable, and Reusable. It enables life science organizations to improve data sharing, educate data consumers about the datasets available to them, and improve transparency and provenance. Together, these principles improve overall data governance.

Diversity and variation in data sources, storage, and coding systems make acquiring, managing, and harmonizing life science datasets complex. Digital immaturity of laboratory instrumentation and data files produced create challenges to implementing FAIR Principles and also increases the risk of data becoming siloed. On-prem workloads to that implement FAIR principles are expensive to run and operate due to throughput bottlenecks and requirements for manual input. Gaps in effective data governance and management (accessibility, accuracy, and consistency) can result in a long time-to-market and billions of dollars in cost.

Migrating digital lab workloads to the cloud provides the scalability, flexibility, and automation needed to deliver insights from large and diverse datasets. Pairing AWS HealthOmics with TetraScience allows bioinformatics teams to overcome existing challenges, implement a FAIR data governance strategy, and automate data processing which can reduce time-to-market and overall costs. TetraScience can be a central component to a lab data mesh architecture and accelerate an organization’s Digital Labs Strategy.

Here we walk through a solution that ingests raw genomic sequencing data into TDP, processes it with HealthOmics, and returns results in TDP for further analysis. For data processing it uses HealthOmics Ready2Run Workflows for Genome Analysis Toolkit (GATK) Best Practices data-preprocessing and germline variant discovery from the Broad Institute. Orchestration via AWS Lambda, AWS Step Functions, Amazon DynamoDB and Amazon Simple Notification Service (SNS) integrates HealthOmics with TDP. Code to deploy this solution into your account is available on GitHub.

Solution Overview

The solution uses the TDP as a data lake for governed access (Accessible) to both raw and processed data, automatically records data provenance (Findable), and enables data re-use with other projects (Reusable/ Interoperable).

The provided solution consists of four parts:

Genomics files from output directories of sequencers are collected and uploaded to the TDP
Tetra Data Pipelines stage data from the TDP to Amazon Simple Storage Service (S3)
Orchestration with SNS and Step Functions triggers HealthOmics workflows to process staged data in S3. Results are written back to S3 or HealthOmics Analytics Stores.
Processed results are published back to TDP (“loopback”)

Figure 1: Solution Architecture integrating the Tetra Data Platform with AWS HealthOmics.

Prerequisites

The solution has the following prerequisites:

An AWS account with permissions to create AWS resources that are used in this blog (you need to have at a minimum the permissions to create IAM roles through the CAPABILITY_NAMED_IAM flag as part of the Cloudformation deployment)
Tetra Data Platform tenant (requires a subscription)

Deployment

Detailed deployment steps are found in the README file in the GitHub repository associated with this solution. It will allow you to deploy the solution using an AWS SAM template to initiate a AWS CloudFormation stack (you will choose the name), which you can reference later in this document to clean up resources.

Tetra Data Pipelines

Tetra Data Pipelines automate data operations and transformations by taking actions on files as they are loaded into the Tetra Data Platform. TetraScience provides a pipeline artifact for pushing data to S3. TetraScience also provides a library of pipeline artifacts – within the TDP instance (navigate to Artifacts, Protocols) – that include:

parsers and normalized data models for hundreds of scientific instruments and proprietary data formats to enable interoperability for integration, analytics, and data science use cases
common integration endpoints, such as Electronic Lab Notebooks (ELN)s, Laboratory Information Management Systems (LIMS), and cloud services like S3.

Customers can also create their own artifacts using the TetraScience Python SDK.

Data staging in Amazon S3

Raw genomic sequencing data received by TDP is staged in S3 buckets before downstream processing with HealthOmics. Raw and processed data in S3 can also be imported into purpose-built HealthOmics Sequence and Analytics stores. HealthOmics Workflows can process data from S3 or HealthOmics Sequence stores. Workflow results are written back to S3 and the TDP.

Data processing with AWS service orchestration

A user creates a metadata file that specifies which HealthOmics Workflow to run, and workflow parameters to use like input files. Uploading this file to TDP triggers processing in AWS. First, raw genomic sequencing files are imported a HealthOmics Sequence store. Second, the specified workflow is run using data from S3 and the Sequence store. Third, the results of the workflow are sent back to TDP. These steps are coordinated using a hybrid of orchestration and choreography architectural patterns. Each step uses a separate Step Functions to orchestrate services therein. Each step is also choreographed with others using events emitted by S3 and HealthOmics via Amazon EventBridge.

Tetra Data Platform loop back

Workflow results, labelled with the original metadata TDP file ID, are sent back to the TDP. This enables automated maintenance of data provenance – correlation of results to original inputs and metadata.

Cleaning-up resources (Optional)

To help prevent unwanted charges to your AWS account, you can delete the AWS resources that you used for this walkthrough. To do this, you can use the cleanup instructions in the GitHub repository. These instructions walk you through deleting data staged in S3 and HealthOmics, and removing remaining resources via AWS CloudFormation.

Conclusion

Integrating the Tetra Data Platform with AWS HealthOmics enables automated processing of genomics data and storage of analysis results with provenance to the originating laboratory equipment. In doing so, this enables researchers to adhere to FAIR Data Principles, increasing operational efficiency and accelerating innovation in R&D. This integration of a FAIR data hub with bioinformatics pipelines can form a central component of a laboratory data mesh, to manage metadata and scientific data governance for an organization.

Visit TetraScience on the AWS Marketplace and check out our GitHub repository to use this solution as a template to enable your organization to automate your genomics processing and implement a FAIR data governance strategy.