Genomics workflows, Part 3: automated workflow manager

Genomics workflows are high-performance computing workloads. Life-science research teams make use of various genomics workflows. With each invocation, they specify custom sets of data and processing steps, and translate them into commands. Furthermore, team members stay to monitor progress and troubleshoot errors, which can be cumbersome, non-differentiated, administrative work.

In Part 3 of this series, we describe the architecture of a workflow manager that simplifies the administration of bioinformatics data pipelines. The workflow manager dynamically generates the launch commands based on user input and keeps track of the workflow status. This workflow manager can be adapted to many scientific workloads—effectively becoming a bring-your-own-workflow-manager for each project.

Use case

In Part 1, we demonstrated how life-science research teams can use Amazon Web Services to remove the heavy lifting of conducting genomic studies, and our design pattern was built on AWS Step Functions with AWS Batch. We mentioned that we’ve worked with life-science research teams to put failed job logs onto Amazon DynamoDB. Some teams prefer to use command-line interface tools, such as the AWS Command Line Interface; other interfaces, such as PyBDA with Apache Spark, or CWL experimental grammar in combination with the Amazon Simple Storage Service (Amazon S3) API, are also used when access to the AWS Management Console is prohibited. In our use case, scientists used the console to easily update table items, plus initiate retry via DynamoDB streams.

In this blog post, we extend this idea to a new frontend layer in our design pattern. This layer automates command generation and monitors the invocations of a variety of workflows—becoming a workflow manager. Life-science research teams use multiple workflows for different datasets and use cases, each with different syntax and commands. The workflow manager we create removes the administrative burden of formulating workflow-specific commands and tracking their launches.

Solution overview

We allow scientists to upload their requested workflow configuration as objects in Amazon S3. We use S3 Event Notifications on PUT requests to invoke an AWS Lambda function. The function parses the uploaded S3 object and registers the new launch request as a DynamoDB item using the PutItem operation. Each item corresponds with a distinct launch request, stored as key-value pair. Item values store the:

S3 data path containing genomic datasets
Workflow endpoint
Preferred compute service (optional)

Another Lambda function monitors for change data captures in the DynamoDB Stream (Figure 1). With each PutItem operation, the Lambda function prepares a workflow invocation, which includes translating the user input into the syntax and launch commands of the respective workflow.

In the case of Snakemake (discussed in Part 2), the function creates a Snakefile that declares processing steps and commands. The function spins up an AWS Fargate task that builds the computational tasks, distributes them with AWS Batch, and monitors for completion. An AWS Step Functions state machine orchestrates job processing, for example, initiated by Tibanna.

Amazon CloudWatch provides a consolidated overview of performance metrics, like time elapsed, failed jobs, and error types. We store log data, including status updates and errors, in Amazon CloudWatch Logs. A third Lambda function parses those logs and updates the status of each workflow launch request in the corresponding DynamoDB item (Figure 1).

Figure 1. Workflow manager for genomics workflows

Implementation considerations

In this section, we describe some of our past implementation considerations.

Register new workflow requests

DynamoDB items are key-value pairs. We use launch IDs as key, and the value includes the workflow type, compute engine, S3 data path, the S3 object path to the user-defined configuration file and workflow status. Our Lambda function parses the configuration file and generates all commands plus ancillary artifacts, such as Snakefiles.

Launch workflows

Launch requests are picked by a Lambda function from the DynamoDB stream. The function has the following required parameters:

Launch ID: unique identifier of each workflow launch request
Configuration file: the Amazon S3 path to the configuration sheet with launch details (in s3://bucket/object format)
Compute service (optional): our workflow manager allows to select a particular service on which to run computational tasks, such as Amazon Elastic Compute Cloud (Amazon EC2) or AWS ParallelCluster with Slurm Workload Manager. The default is the pre-defined compute engine.

These points assume that the configuration sheet is already uploaded into an accessible location in an S3 bucket. This will issue a new Snakemake Fargate launch task. If either of the parameters is not provided or access fails, the workflow manager returns MissingRequiredParametersError.

Log workflow launches

Logs are written to CloudWatch Logs automatically. We write the location of the CloudWatch log group and log stream into the DynamoDB table. To send logs to Amazon CloudWatch, specify the awslogs driver in the Fargate task definition settings in your provisioning template.

Our Lambda function writes Fargate task launch logs from CloudWatch Logs to our DynamoDB table. For example, OutOfMemoryError can occur if the process utilizes more memory than the container is allocated.

AWS Batch job state logs are written to the following log group in CloudWatch Logs: /aws/batch/job. Our Lambda function writes status updates to the DynamoDB table. AWS Batch jobs may encounter errors, such as being stuck in RUNNABLE state.

Manage state transitions

We manage the status of each job in DynamoDB. Whenever a Fargate task changes state, it is picked up by a CloudWatch rule that references the Fargate compute cluster. This CloudWatch rule invokes a notifier Lambda function that updates the workflow status in DynamoDB.

Conclusion

In this blog post, we demonstrated how life-science research teams can simplify genomic analysis across an array of workflows. These workflows usually have their own command syntax and workflow management system, such as Snakemake. The presented workflow manager removes the administrative burden of preparing and formulating workflow launches, increasing reliability.

The pattern is broadly reusable with any scientific workflow and related high-performance computing systems. The workflow manager provides persistence to enable historical analysis and comparison, which enables us to automatically benchmark workflow launches for cost and performance.

Stay tuned for Part 4 of this series, in which we explore how to enable our workflows to process archival data stored in Amazon Simple Storage Service Glacier storage classes.

AWS Architecture Blog