AWS for Industries

Automated genomic data processing with AWS HealthOmics & TetraScience

Data governance strategies based on FAIR Data principles can increase operational efficiency and accelerate innovation in R&D. Amazon Web Services (AWS), along with partners such as TetraScience, enable pharmaceutical and biotech customers to adopt FAIR data governance practices with solutions that help process and store large-scale data for research and diagnostic purposes.

This blog shows how AWS HealthOmics and TetraScience can make lab instrument and file datasets available for downstream analysis using FAIR principles. HealthOmics is a managed service for storing, querying, and analyzing genomics, transcriptomics, and other omics data.

TetraScience provides solutions to capture and enable centralized, standardized access to scientific data via the Tetra Scientific Data Cloud, a purpose-built solution consisting of three pillars: the Tetra Data Platform (TDP) an AWS-native cloud data solution, the Tetra Partner Network, and Tetra Catalysts (embedded analysts with deep domain expertise in both science and technology). These facilitate secure and reliable collection of data from a diverse range of laboratory instruments used in pharmaceutical and biotech organizations.

FAIR Data Principles for scientific data management and stewardship were published in 2016. FAIR was intended to break down data silos and provide guidelines for making data Findable, Accessible, Interoperable, and Reusable. It enables life science organizations to improve data sharing, educate data consumers about the datasets available to them, and improve transparency and provenance. Together, these principles improve overall data governance.

Diversity and variation in data sources, storage, and coding systems make acquiring, managing, and harmonizing life science datasets complex. Digital immaturity of laboratory instrumentation and data files produced create challenges to implementing FAIR Principles and also increases the risk of data becoming siloed. On-prem workloads to that implement FAIR principles are expensive to run and operate due to throughput bottlenecks and requirements for manual input. Gaps in effective data governance and management (accessibility, accuracy, and consistency) can result in a long time-to-market and billions of dollars in cost.

Migrating digital lab workloads to the cloud provides the scalability, flexibility, and automation needed to deliver insights from large and diverse datasets. Pairing AWS HealthOmics with TetraScience allows bioinformatics teams to overcome existing challenges, implement a FAIR data governance strategy, and automate data processing which can reduce time-to-market and overall costs. TetraScience can be a central component to a lab data mesh architecture and accelerate an organization’s Digital Labs Strategy.

Here we walk through a solution that ingests raw genomic sequencing data into TDP, processes it with HealthOmics, and returns results in TDP for further analysis. For data processing it uses HealthOmics Ready2Run Workflows for Genome Analysis Toolkit (GATK) Best Practices data-preprocessing and germline variant discovery from the Broad Institute. Orchestration via AWS Lambda, AWS Step Functions, Amazon DynamoDB and Amazon Simple Notification Service (SNS) integrates HealthOmics with TDP. Code to deploy this solution into your account is available on GitHub.

Solution Overview

The solution uses the TDP as a data lake for governed access (Accessible) to both raw and processed data, automatically records data provenance (Findable), and enables data re-use with other projects (Reusable/ Interoperable).

The provided solution consists of four parts:

  1. Genomics files from output directories of sequencers are collected and uploaded to the TDP
  2. Tetra Data Pipelines stage data from the TDP to Amazon Simple Storage Service (S3)
  3. Orchestration with SNS and Step Functions triggers HealthOmics workflows to process staged data in S3. Results are written back to S3 or HealthOmics Analytics Stores.
  4. Processed results are published back to TDP (“loopback”)

Figure 1: Solution Architecture integrating the Tetra Data Platform with AWS HealthOmics.Figure 1: Solution Architecture integrating the Tetra Data Platform with AWS HealthOmics.

Prerequisites

The solution has the following prerequisites:

  • An AWS account with permissions to create AWS resources that are used in this blog (you need to have at a minimum the permissions to create IAM roles through the CAPABILITY_NAMED_IAM flag as part of the Cloudformation deployment)
  • Tetra Data Platform tenant (requires a subscription)

Deployment

Detailed deployment steps are found in the README file in the GitHub repository associated with this solution. It will allow you to deploy the solution using an AWS SAM template to initiate a AWS CloudFormation stack (you will choose the name), which you can reference later in this document to clean up resources.

Tetra Data Pipelines

Tetra Data Pipelines automate data operations and transformations by taking actions on files as they are loaded into the Tetra Data Platform. TetraScience provides a pipeline artifact for pushing data to S3. TetraScience also provides a library of pipeline artifacts – within the TDP instance (navigate to Artifacts, Protocols) – that include:

  •  parsers and normalized data models for hundreds of scientific instruments and proprietary data formats to enable interoperability for integration, analytics, and data science use cases
  • common integration endpoints, such as Electronic Lab Notebooks (ELN)s, Laboratory Information Management Systems (LIMS), and cloud services like S3.

Customers can also create their own artifacts using the TetraScience Python SDK.

Data staging in Amazon S3

Raw genomic sequencing data received by TDP is staged in S3 buckets before downstream processing with HealthOmics. Raw and processed data in S3 can also be imported into purpose-built HealthOmics Sequence and Analytics stores. HealthOmics Workflows can process data from S3 or HealthOmics Sequence stores. Workflow results are written back to S3 and the TDP.

Data processing with AWS service orchestration

A user creates a metadata file that specifies which HealthOmics Workflow to run, and workflow parameters to use like input files. Uploading this file to TDP triggers processing in AWS. First, raw genomic sequencing files are imported a HealthOmics Sequence store. Second, the specified workflow is run using data from S3 and the Sequence store. Third, the results of the workflow are sent back to TDP. These steps are coordinated using a hybrid of orchestration and choreography architectural patterns. Each step uses a separate Step Functions to orchestrate services therein. Each step is also choreographed with others using events emitted by S3 and HealthOmics via Amazon EventBridge.

Tetra Data Platform loop back

Workflow results, labelled with the original metadata TDP file ID, are sent back to the TDP. This enables automated maintenance of data provenance – correlation of results to original inputs and metadata.

Cleaning-up resources (Optional)

To help prevent unwanted charges to your AWS account, you can delete the AWS resources that you used for this walkthrough. To do this, you can use the cleanup instructions in the GitHub repository. These instructions walk you through deleting data staged in S3 and HealthOmics, and removing remaining resources via AWS CloudFormation.

Conclusion

Integrating the Tetra Data Platform with AWS HealthOmics enables automated processing of genomics data and storage of analysis results with provenance to the originating laboratory equipment. In doing so, this enables researchers to adhere to FAIR Data Principles, increasing operational efficiency and accelerating innovation in R&D. This integration of a FAIR data hub with bioinformatics pipelines can form a central component of a laboratory data mesh, to manage metadata and scientific data governance for an organization.

Visit TetraScience on the AWS Marketplace and check out our GitHub repository to use this solution as a template to enable your organization to automate your genomics processing and implement a FAIR data governance strategy.

Further Reading

Modood Alvi

Modood Alvi

Modood Alvi is a Senior Solutions Architect at Amazon Web Services (AWS). Modood is passionate about the digital transformation and he is committed helping large enterprise customers across the globe to accelerate their adoption of and migration to the cloud. Modood brings more than decade of experience in software development having held a variety of technical roles within companies like SAP and Porsche Digital. Modood earned his Diploma in Computer Science from the University of Stuttgart.

Abril Trejo

Abril Trejo

In her current role, she helps healthcare and life science organizations utilize the flexibility and scalability of the cloud to innovate solutions. With over a decade of experience working with cloud technologies, Abril has developed deep expertise across regulated industries. Prior to joining Amazon, she held various IT leadership positions at healthcare and pharmaceutical companies. In her spare time, Abril enjoys travelling and watching tennis.

Ahmer Memon

Ahmer Memon

Ahmer Memon is a Principal Solutions Architect at Amazon Web Services, focusing on the Healthcare and Life Sciences sector. With over two decades of experience in the technology industry, he has worn many hats - from new product development and research to technical design, architecture and leadership roles. Ahmer's career has spanned several industries including finance, telecommunications, research and development, customer contact, media and entertainment. This diverse background allows him to bring a wealth of knowledge and perspective to his current position developing solutions for healthcare organizations and life sciences companies. An alumnus of Kings College London where he studied Physics, Ahmer is an avid traveler who enjoys exploring new places and immersing himself in local cultures. When he's not on the road, he can be found reading broadly on topics from history to personal development or spending time with his family.

Amir Kader

Amir Kader

He has over 20 years of experience assisting organizations with their digital transformation initiatives.

Sam Kool

Sam Kool

Sam Kool is a Senior Solutions Architect at Amazon Web Services (AWS), specialising in healthcare and life sciences. In this role, Sam partners with pharmaceutical companies to optimise their drug discovery processes and accelerate time-to-market using AWS cloud technologies. Sam studied Medical Informatics at the University of Amsterdam. Combined with over 5 years of experience in both Healthcare, Pharma and Cloud Architecture he is able to bridge the gap between patient care and technology.

Wajahat Aziz

Wajahat Aziz

Wajahat Aziz is a London-based ML/HPC Research Solutions Architect in AWS’s Healthcare and Life Sciences practice. Having worked for a number of life sciences organizations, he leverages his industry expertise in building innovative solutions for the customers. His areas of focus are Machine Learning, Analytics and High Performance Computing. Wajahat received a Masters in Software Engineering, with distinction, from University of Oxford.