AWS for Industries

New Tools to Accelerate Workflow Migrations to AWS HealthOmics

Although genomics workflow languages are designed to improve the portability and reproducibility of analyses, the migration of workflows from one runtime environment to another can be challenging if the workflow makes strong assumptions about that environment. Here we present two customer-tested tools that help detect workflow issues and accelerate the migration of resources to smooth the migration of a workflow to AWS HealthOmics.

AWS HealthOmics helps healthcare and life science organizations build at-scale to store, query, and analyze genomic, transcriptomic, and other omics data. These properties make HealthOmics an attractive option for running workflows at-scale. Often these workflows have been developed in small test-environments, have been obtained from open source code repositories, or are currently being run in an on-premise environment.

In all cases, the workflow may have properties that are specific to the infrastructure environment where it was developed or is currently being run. For example, workflows developed for the Cromwell WDL engine often use automatic type coercions that are not in the WDL language specification and which are not supported by other engines including the HealthOmics workflow engine due to them being considered potentially harmful. Another common example are workflows that directly modify task inputs. While the language specification doesn’t forbid this, in a distributed, concurrent environment this can cause hard to diagnose race conditions. As such the HealthOmics workflow engine will prevent this. Rather than being compiled, genomics workflow languages are interpreted at runtime so incompatible assumptions are often not realized until the workflow is running. Many genomics workflows will also run for several hours so it is better to try to detect and correct potential issues before a workflow is even run.

To assist with the process of porting a workflow to run on AWS HealthOmics we have built two tools. The first is a linter for WDL (Workflow Definition Language) scripts. The second tool automates the process of identifying container images required by the workflow and replicates these images in private Amazon Elastic Container Registry (Amazon ECR) repositories secured for access by HealthOmics workflow runs.

HealthOmics Linter

In computer science, a linter is a tool that statically analyzes code to spot suspicious constructs and potential bugs that are otherwise semantically correct. Static, as opposed to dynamic, analysis does not require the code to be run. This is highly desirable for genomics workflows where run times can be many hours. By using a static linter you can spot and correct many potential problems before a run is even started.

The open-source miniwdl package includes an excellent WDL linter that can detect many potential issues in a WDL script. We extended this package to include a number of checks and recommendations for WDL scripts that will be run in AWS HealthOmics. In addition to miniwdl’s standard WDL and Bash checks we include checks for:

  • Detection of file resources that are not defined by s3:// or omics:// URI paths,
  • Task runtime keys that are not required or supported by HealthOmics
  • Incorrectly configured GPU resource requests
  • Incorrectly specified memory requests
  • Dependencies on containers not in ECR
  • Checks for recommended Bash directives such as set -e

The HealthOmics linter and all it’s required dependencies are distributed as a publicly available container image available from the Amazon ECR Public Gallery. To lint a WDL file called main.wdl you would run:

docker run -v $PWD:/scripts/ public.ecr.aws/aws-genomics/healthomics-linter main.wdl

The linter will check ./main.wdl and will also attempt to check any imports that can be located relative to the location of the named file assuming these are also mounted via the -v docker run option. The linter does not require any AWS credentials as all checking is performed locally.

Example

Given the simple WDL file main.wdl:

version 1.1

task hello {
  input {
    File infile
    String pattern
  }

  command <<<
    egrep '~{pattern}' '~{infile}'
  >>>

  runtime {
    container: "my_image:latest"
  }

  output {
    Array[String] matches = read_lines(stdout())
  }
}

workflow wf {
  input {
    File infile
    String pattern
  }

  call hello {
    input: infile, pattern
  }

  output {
    Array[String] matches = hello.matches
  }
}

The linter will produce the following output to STDOUT:

main.wdl
    workflow wf
        call hello
    task hello
        (Ln 9, Col 11) OmicsNoImmediateExitOnError, Command does not contain a 'set -e' directive. Without this only the return code of the last command in the script will determine task success. Consider adding 'set -e' so that any command failure immediately stops the task with a 'FAILED' state.
        (Ln 10, Col 4) CommandShellCheck, SC2196 egrep is non-standard and deprecated. Use grep -E instead.
        (Ln 14, Col 16) OmicsNonEcrContainerImage, A declaration or reference is made to a docker/ container image 'my_image:latest' that is not an ECR image. AWS HealthOmics task containers must be in ECR and in the same account and region as the workflow run.
        (Ln 14, Col 16) OmicsRecommendedRuntimeKeyMissing, Tasks should declare a 'cpu' value in the runtime section. Failure to do so will result in the task container being implicitly restricted to 1 vCPU. A minimum int value of 2 is recommended.
        (Ln 14, Col 16) OmicsRecommendedRuntimeKeyMissing, Tasks should declare a 'memory' value in the runtime section. If no declaration is found the task container will be severely restricted to a minimal hard memory limit.

The linter begins by listing the workflow files and components that are being traversed. Indentation shows the level of nesting in the workflow syntax tree. In the hello task of the workflow the linter has identified four issues. Each issue is reported on a separate line of the output. Each output line starts with the line number and column number in main.wdl where the issue originates. The location is followed by the title of the type of check, for example OmicsNoImmediateExitOnError and a descriptive sentence of why this might be a problem and what might be done to correct it. Check types that begin with Omics* are specific to the HealthOmics workflow service and indicate issues that should be inspected before attempting to run the workflow in this service.

Based on this feedback from the linter we can improve the workflow definition to:

version 1.1

task hello {
  input {
    File infile
    String pattern
  }

  command <<<
    set -e
    grep -E '~{pattern}' '~{infile}'
  >>>

  runtime {
    container: "123456789012.dkr.ecr.us-east-1.amazonaws.com/my_image:latest"
    cpu: 2
    memory: "4 GiB"
  }

  output {
    Array[String] matches = read_lines(stdout())
  }
}

workflow wf {
  input {
    File infile
    String pattern
  }

  call hello {
    input: infile, pattern
  }

  output {
    Array[String] matches = hello.matches
  }
}

HealthOmics Amazon ECR Helper

The HealthOmics Amazon ECR Helper is a simple serverless application that helps automate preparing containers for use with HealthOmics Workflows. The helper performs two key functions:

  1. container-puller: Retrieves container images from public registries like (Amazon ECR Public Gallery, Quay.io, DockerHub) and stages them in Amazon ECR Private image repositories in your AWS account
  2. container-builder: Builds Amazon ECR Private container images from source bundles staged in S3. This is most helpful when you are drafting a new workflow or if you have to create container images from scratch when the original containers are not publicly available.

The tool relies on AWS Step Functions, AWS CodeBuild, and Amazon ECR to automate the process and the required components can be deployed into your account using an AWS Cloud Development Kit (AWS CDK) template.

Retrieving container images

The container-puller requires a json file that contains a manifest of images. For example:

{
    "manifest": [
        "ubuntu:20.04",
        "us.gcr.io/broad-gatk/gatk:4.4.0.0",
        "quay.io/biocontainers/bcftools:1.16--hfe4b78e_1",
        "public.ecr.aws/docker/library/python:3.9.16-bullseye",
        "quay/biocontainers/bwa-mem2:2.2.1--he513fc3_0",
        "ecr-public/aws-genomics/google/deepvariant:1.4.0"
    ]
}

With the AWS CDK stack deployed in your account and the manifest created, you can simply run the following AWS StepFunctions command using the AWS Command Line Interface (AWS CLI).

aws stepfunctions start-execution \
    --state-machine-arn arn:aws:states:<aws-region>:<aws-account-id>:stateMachine:omx-container-puller \
    --input file://container_image_manifest.json

The AWS StepFunctions state machine automates the migration of container images by re-staging them from public registries to an ECR private registry. To do this, it first relies on ECR Pull-through caching which:

  • allows docker clients to pull an image URI that looks like it comes from ECR Private
  • creates a private image repository based on the public image that is being pulled
  • pulls the public image and caches it in the private repository

Second, an Amazon EventBridge rule is used to detect when an image repository is created. Third, the EventBridge rule triggers a Lambda function that applies the required access policy to the ECR repository that was created.

Currently, AWS HealthOmics checks for the existence of ECR repositories and specific image URIs before launching ECS tasks. This pre-check means you need to “prime” container images into ECR Private prior to a running a workflow that depends on them the first time, even if pull-through caching is enabled.

The priming process is automated by submitting a container image manifest to an AWS StepFunctions state machine that calls an AWS CodeBuild Project that retrieves container image URIs.

The mechanisms above are also generalized to support other public container registries in the following ways:

  • The Amazon ECR CreateRepository API is called when a corresponding repository does not exist. This is only used when retrieving images from public registries that do not support pull-through caching.
  • AwsApiCall events that create Amazon ECR repositories with the tag Key=createdBy,Value=omx-ecr-helper will also trigger the Lambda Function.
  • The CodeBuild project is parameterized to do either pull-through only or pull and push actions.

To save costs the workflow will only run the CodeBuild project if a requested image uri does not already have a corresponding private ECR image.

Full details for this tool as well as the open-source code are available in GitHub.

Customer Use Case

Recently, a researcher at Biogen Inc. wanted to migrate some of the publicly available WDL Analysis Research Pipelines (WARP) to perform proof of concept tests on AWS HealthOmics. Assisted by these two tools, they were able to successfully complete the migration of these complex workflows.

“The very first one took around 16 hours, I think next pipeline will take no more than an hour or so for me to migrate, since it is very straightforward now”. – Sergey Bakhteiarov, Lead Data Engineer

Conclusion

Developer productivity can be defined by how rapidly they can proceed through a build → test → run cycle. Here we introduce two tools that can improve productivity by automating some of the work of workflow migration and potentially short-circuiting the cycle by quickly identifying likely errors before a workflow is run. Why not try out both of these tools for your next genomics workflow project to see how they can make you more productive and accelerate time-to-insights.

As with all of our tools we welcome customer feedback on any improvements you would like to see.

Mark Schreiber

Mark Schreiber

Mark is a Senior Genomics Consultant working in the AWS Health artificial intelligence (AI) team. Mark specializes in genomics and life sciences applications and data. He holds a PhD from the University of Otago in New Zealand. Prior to joining AWS, he worked for several years with pharmaceutical and biotech companies. Mark is also a frequent contributor to open-source projects.

Lee Pang

Lee Pang

Lee is a Principal Bioinformatics Architect with the Health AI services team at AWS. He has a PhD in Bioengineering and over a decade of hands-on experience as a practicing research scientist and software engineer in bioinformatics, computational systems biology, and data science developing tools ranging from high throughput pipelines for *omics data processing to compliant software for clinical data capture and analysis.