Designing an event-driven architecture for Bioinformatics workflows using AWS HealthOmics and Amazon EventBridge

Healthcare and Life Science organizations have a mission to improve human health and make groundbreaking scientific discoveries. More often than not, they leverage genomic, transcriptomic or other omics data to achieve this goal. They use a collection of bioinformatics tools and methods to process this data and derive scientific insights. Bioinformatics workflows enable orchestration of these tools to produce the necessary data required for reporting or further analysis. These workflows are complex, multi-step pipelines that require varied amounts of compute resources and software dependencies. Customers need a reliable, secure and scalable environment to execute such bioinformatics workflows at scale. In addition, they would like to reduce the time spent on undifferentiated tasks such as managing infrastructure and workflow management software. To address this need, AWS introduced AWS HealthOmics, a purpose-built service that helps healthcare and life science organizations and their software partners store, query, and analyze genomic, transcriptomic, and other omics data and then generate insights from that data to improve health. It supports large-scale analysis and collaborative research.

Customers often run more than one bioinformatics workflow to produce the report or research ready datasets. They need these workflows to be triggered based on certain events such as arrival of new raw data from sequencing instruments or external vendors, completion of a previous bioinformatics workflow, events from third-party software such as Laboratory Information Management Systems (LIMS) or Electronic Lab Notebooks (ELN), etc. By automating these processes, customers can reduce manual intervention and as a result reduce potential for errors and reduce the time to get results of all the workflows. In addition, using event-based notifications, customers can react to events such as failures or completed analysis as needed and be operationally efficient.

In this blog post, we demonstrate how to automate running multiple bioinformatics workflows and receive event-based notifications using an event-driven architecture. Sample code for the solution with instructions on how to setup and use it are available on GitHub.

Overview

Customers typically develop bioinformatics workflows in an interactive manner, running them manually to ensure they produce the right outputs with no errors. Once they are ready for production at scale, these workflows need to be automated for maximum efficiency. There are several scenarios where event-driven automation is beneficial based on the result or the status of a workflow, such as:

A user or group of users need to be notified if a workflow run has failed so that they can inspect, debug and re-run the workflow, if needed.
A system, such as a Laboratory Information Management System (LIMS) or an Electronic Lab Notebook (ELN), needs to be updated with a sample’s workflow run status.
Outputs of a completed workflow run need to be automatically transferred to a customer or collaborator.
Customers can have multiple workflows that are expected to run in a certain order, such as demultiplexing, followed by secondary analysis, followed by joint variant calling.

An event-driven architecture enables building robust and automated solutions to solve for the aforementioned use cases. Customers can leverage events generated from supported AWS services to trigger a certain action instantaneosuly.

Architectural Overview

In this blog post you will learn how events from AWS HealthOmics can be captured using Amazon EventBridge to trigger actions as needed. As part of the example solution, we demonstrate the automated launch of a genomics secondary analysis workflow when raw genomic sequencing data lands in Amazon S3. Upon successful completion, we launch the next workflow that performs tertiary analysis using output data from the secondary analysis workflow. For operational vigilance, if either of the workflows fail, a user or a group of users is notified about the failure via email.

Figure 1: Overall architecture of the solution that illustrates how the event of input data landing in Amazon S3 automatically launches bioinformatics workflows and performs required downstream actions using events.

This solution creates specific resources using the following AWS services:

1. Amazon Simple Storage Service (S3) – We use Amazon S3 for data storage due to its industry leading scalability, data availability, security, and performance. We created two buckets, one to receive input FASTQ files and sample manifest CSVs, and the other to upload the bioinformatics workflow outputs. We also use S3 Event Notifications to launch a workflow when new data lands in Amazon S3.

2. AWS HealthOmics – We use AWS HealthOmics, a purpose-built service that helps healthcare and life science organizations and their software partners store, query, and analyze genomic, transcriptomic, and other omics data and then generate insights from that data to improve health. It enables customers to focus on science and not the infrastructure. In this example, we use two workflows, one Private and one Ready2Run:

a. Private workflows enable you to bring your own bioinformatics scripts that are written in any of the three most commonly used workflow languages: Nextflow, WDL, and CWL.
b. Ready2Run workflows are pre-built workflows that have been created by industry leading third party software companies like Sentieon, Inc., NVIDIA, and Element Biosciences along with common open-source pipelines such as Broad Institute’s GATK best practice workflow, nf-core scRNAseq, and AlphaFold for protein structure prediction that are maintained by AWS.

3. Amazon Simple Notification Service (SNS) – We use Amazon SNS, a web service that makes it easy to set-up, operate, and send notifications from the cloud. We created an SNS topic to notify an email distribution list when a HealthOmics workflow fails.

4. AWS Lambda – We use AWS Lambda, which is serverless, event-driven, and lets you run any type of application and allows us to maintain a modular solution. We created two Lambda functions, each meant to process events and trigger downstream actions such as launching a HealthOmics workflow or triggering notifications using SNS.

5. Amazon EventBridge – We use Amazon EventBridge, a service that provides real-time access to changes in data in AWS services, to build event-driven and loosely coupled applications across AWS. We created an EventBridge bus and rules for each workflow to capture success and failure.

6. Amazon CloudWatch – We use Amazon CloudWatch to observe and monitor resources and applications on AWS. A CloudWatch log group is created to capture HealthOmics workflow logs and present them to users upon workflow failure.

7. Amazon Elastic Container Registry (ECR) – We use Amazon ECR to easily store, share, and deploy docker images. We created an ECR repository for each docker image used by the tasks in our HealthOmics private workflow. The Ready2Run workflow ECR images are managed by the service.

This solution leverages the services listed above to ensure that the sequencing data gets processed in an automated manner with notifications built in to be operationally efficient.

Bioinformatics Workflows, Data, and Automation

A common pattern of genomic sequencing data analysis starts at FASTQ files. These files are processed using secondary analysis workflows that generate variant data as a VCF file. VCFs are analyzed further using variant annotation and interpretation tools. In our example implementation, we use raw genomic FASTQ files hosted in a public Amazon S3 bucket and made available by AWS. We demonstrate how to process this data using the GATK Best Practices workflow for secondary analysis. This is available as a HealthOmics Ready2Run workflow and produces an output file – genomic VCF (gVCF). The output gVCF data is further analyzed using Variant Effect Predictor (VEP), which is implemented as a HealthOmics Private workflow by using publicly-available code.

To achieve end-to-end automation and real-time notifications, we perform the following steps:

1. Upload a sample manifest CSV to a specific S3 location. This file includes S3 locations of FASTQ files.

2. S3 Event Notifications trigger a lambda function to launch the HealthOmics Ready2Run workflow – GATK BP Fastq2Vcf. (Note that AWS HealthOmics sequence stores can also be used instead of S3 to store FASTQ files and potentially provide additional cost benefits and similar integrations to enable automation)

3. Successful completion or failure of the workflow produces an event which is captured by an Amazon EventBridge rule (example event shown here).

4. A successful workflow run event triggers a lambda function which prepares inputs and lauches the HealthOmics private workflow – VEP.

5. If either of the workflows fail, a failed workflow run event notifies a group of users via email configured using Amazon SNS.

Conclusion

In this blog post, we demonstrate how customers can build an event-driven architecture to:

Automate the launch of bioinformatics workflows.
Chain multiple bioinformatics workflows for end-to-end analysis of sequencing data.
Receive real-time notifications to be operationally efficient.

The sample code is available on GitHub. It includes detailed instructions on how to set up the solution in your AWS account and use it. We encourage users to try the solution and extend it as needed.