Using Cromwell with AWS Batch

Contributed by W. Lee Pang and Emil Lerch, WWPS Professional Services

DNA is often referred to as the “source code of life.” All living cells contain long chains of deoxyribonucleic acid that encode instructions on how they are constructed and behave in their surroundings. Genomics is the study of the structure and function of DNA at the molecular level. It has recently shown immense potential to provide improved detection, diagnosis, and treatment of human diseases.

Continuous improvements in genome sequencing technologies have accelerated genomics research by providing unprecedented speed, accuracy, and quantity of DNA sequence data. In fact, the rate of sequencing efficiency has been shown to outpace Moore’s law. Processing this influx of genomic data is ideally aligned with the power and scalability of cloud computing.

Genomic data processing typically uses a wide assortment of specialized bioinformatics tools, like sequence alignment algorithms, variant callers, and statistical analysis methods. These tools are run in sequence as workflow pipelines that can range from a couple of steps to many long toolchains executing in parallel.

Traditionally, bioinformaticians and genomics scientists relied on Bash, Perl, or Python scripts to orchestrate their pipelines. As pipelines have gotten more complex, and maintainability and reproducibility have become standard requirements in science, the need for specialized orchestration tooling and portable workflow definitions has grown significantly.

What is Cromwell?

The Broad Institute’s Cromwell is purpose-built for this need. It is a workflow execution engine for orchestrating command line and containerized tools. Most importantly, it is the engine that drives the GATK Best Practices genome analysis pipeline.

Workflows for Cromwell are defined using the Workflow Definition Language (WDL – pronounced “widdle”), a flexible meta-scripting language that allows researchers to focus on the pieces of their workflow that matter. That’s the tools for each step and their respective inputs and outputs, and not the plumbing in between.

Genomics data is not small (on the order of TBs-PBs for one experiment!), so processing it usually requires significant computing scale, like HPC clusters and cloud computing. Cromwell has previously enabled this with support for many backends such as Spark, and HPC frameworks like Sun GridEngine and SLURM.

AWS and Cromwell

We are excited to announce that Cromwell now supports AWS! In this post, we go over how to configure Cromwell on AWS and get started running genomics pipelines in the cloud.

In a nutshell, the AWS backend for Cromwell is a layer that communicates with AWS Batch. Why AWS Batch? As stated before, genomics analysis pipelines are composed of many different tools. Each of these tools can have specific computing requirements. Operations like genome alignment can be memory-intensive, whereas joint genotyping may be compute-heavy.

AWS Batch dynamically provisions the optimal quantity and type of compute resources (for example, CPU or memory-optimized instances). Provisioning is based on the volume and specific resource requirements of the batch jobs submitted. This means that each step of a genomics workflow gets the most ideal instance to run on.

The AWS backend translates Cromwell task definitions into batch job definitions and submits them via API calls to a user-specified batch queue. Runtime parameters such as the container image to use, and resources like desired vCPUs and memory are also translated from the WDL task and transmitted to the batch job. A number of environment variables are automatically set on the job to support data localization and de-localization to the job instance. Ultimately, scientists and genomics researchers should be familiar with the backend method to submit jobs to AWS Batch because it uses their existing WDL files and research processes.

Getting started

To get started using Cromwell with AWS, create a custom AMI. This is necessary to ensure that the AMI is private to the account, encrypted, and has tooling specific to genomics workloads and Cromwell.

One feature of this tooling is the automatic creation and attachment of additional Amazon Elastic Block Store (Amazon EBS) capacity as additional data is copied onto the EC2 instance for processing. It also contains an ECS agent that has been customized to the needs of Cromwell, and a Cromwell Docker image responsible for interfacing the Cromwell task with Amazon S3.

After the custom AMI is created, install Cromwell on your workstation or EC2 instance. Configure an S3 bucket to hold Cromwell execution directories. For the purposes of this post, we refer to the bucket as s3-bucket-name. Lastly, go to the AWS Batch console, and create a job queue. Save the ARN of the queue, as this is needed later.

To get up these resources with a single click, this link provides a set of AWS CloudFormation templates that gets all the needed infrastructure running in minutes.

The next step is to configure Cromwell to work with AWS Batch and your newly created S3 bucket. Use the sample hello.wdl and hello.inputs files from the Cromwell AWS backend tutorial. You also need a custom configuration file so that Cromwell can interact with AWS Batch.

The following sample file can be used on an EC2 instance with the appropriate IAM role attached, or on a developer workstation with the AWS CLI configured. Keep in mind that you must replace <s3-bucket-name> in the configuration file with the appropriate bucket name. Also, replace “your ARN here” with the ARN of the job queue that you created earlier.

// aws.conf

include required(classpath("application"))

aws {

    application-name = "cromwell"
    
    auths = [
        {
         name = "default"
         scheme = "default"
        }
    ]
    
    region = "default"
    // uses region from ~/.aws/config set by aws configure command,
    // or us-east-1 by default
}

engine {
     filesystems {
         s3 {
            auth = "default"
         }
    }
}

backend {
     default = "AWSBATCH"
     providers {
         AWSBATCH {
             actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
             config {
                 // Base bucket for workflow executions
                 root = "s3://<s3-bucket-name>/cromwell-execution"
                
                 // A reference to an auth defined in the `aws` stanza at the top. This auth is used to create
                 // Jobs and manipulate auth JSONs.
                 auth = "default"

                 numSubmitAttempts = 3
                 numCreateDefinitionAttempts = 3

                 concurrent-job-limit = 16
                
                 default-runtime-attributes {
                    queueArn: "<your ARN here>"
                 }
                
                 filesystems {
                     s3 {
                         // A reference to a potentially different auth for manipulating files via engine functions.
                         auth = "default"
                     }
                 }
             }
         }
     }
}

Now, you can run your workflow. The following command runs Hello World, and ensures that everything is connected properly:

$ java -Dconfig.file=aws.conf -jar cromwell-34.jar run hello.wdl -i hello.inputs

After the workflow has run, your workflow logs should report the workflow outputs.

[info] SingleWorkflowRunnerActor workflow finished with status 'Succeeded'.
{
 "outputs": {
    "wf_hello.hello.message": "Hello World! Welcome to Cromwell . . . on AWS!"
 },
 "id": "08213b40-bcf5-470d-b8b7-1d1a9dccb10e"
}

You also see your job in the “succeeded” section of the AWS Batch Jobs console.

After the environment is configured properly, other Cromwell WDL files can be used as usual.

Conclusion
With AWS Batch, a customized AMI instance, and Cromwell workflow definitions, AWS provides a simple solution to process genomics data easily. We invite you to incorporate this into your automated pipeline.

AWS Compute Blog

Using Cromwell with AWS Batch

Resources

Follow