AWS HPC Blog

Getting Started with NVIDIA Clara Parabricks on AWS Batch using AWS CloudFormation

This post was contributed by Gary Burnett, Technical Marketing Engineer at NVIDIA, and Olivia Choudhury, PhD, Senior Partner Solutions Architect at AWS.

Genomic sequencing is faster and cheaper than ever. This results in an enormous amount of data that needs to be processed and it can quickly exhaust the resources available using traditional CPU tools, especially those in an on-premises data center.

This is where NVIDIA Parabricks and AWS Batch can help. Parabricks is a GPU-accelerated tool for secondary genomic analysis. It reduces the runtime of variant calling on a 30x human genome from 30 hours to just 30 minutes. AWS Batch creates an interface to easily scale up compute jobs across multiple nodes.

In this blog post, we’ll show how you can run NVIDIA Parabricks on AWS Batch leveraging AWS CloudFormation templates.

Our environment

This guide will show you how to spin up an Amazon ECS cluster autoscaling that can be monitored through the AWS Batch console. This cluster has two nodes for demo purposes, and you can send one job to each node.

Prerequisites

Before you get started on AWS Batch, you need to modify the Parabricks container to make it compatible with the workflow on AWS. That’s because in a traditional installation, the Parabricks Docker container is spun up in the background using the pbrun Linux command. For Batch, we need to expose the full container. To do that that, you just remove the entry point and untar some files inside the container.

You can use Amazon Elastic Container Registry (ECR) to host the container and make it accessible to AWS Batch. This is a short process, involving just four steps:

  1. Create a container repository on Amazon ECR
  2. Install Parabricks via Docker on a single node
  3. Make modifications to the Parabricks docker image
  4. Upload the modified Docker image to ECR

Create a container repository on Amazon ECR

Create a private container repository on Amazon ECR.

Figure 1: Create a private container repository on Amazon ECR.

Figure 1: Create a private container repository on Amazon ECR.

Navigate to the Amazon ECR Console and click “Create repository” in the upper right corner. Choose the repository to be private and give the repository a name. Then click “Create repository”.

Install Parabricks via Docker on a single node

This can be done on a local machine or on an Amazon EC2 instance. If you choose to install locally, make sure the machine has the AWS CLI and is configured with your AWS credentials.

If you already have a working Parabricks Docker installation, you can skip this step. Otherwise, fill out the Parabricks Trial License Request form on the NVIDIA website and follow the instructions in your email to perform a Docker installation (this is the default installation). You can verify your Parabricks installation by running the following command:

$ pbrun --version
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation
pbrun: 3.7.0-1

Make modifications to the Parabricks docker image.

This and the following steps should be done on the machine with the Parabricks installation. You will use the Docker image from the standard install as your base image and the following Dockerfile to build on top of it. Make sure to add your Parabricks version number (ex. 3.7.0-1) where it says <VERSION NUMBER>.

$ cat Dockerfile
FROM parabricks/release:<VERSION NUMBER>

# Untar this folder to access the pbrun executable
RUN cd /parabricks && tar xzvf release-$version.tar.gz

# Add the pbrun executable to the path
ENV PATH="/parabricks/release-<VERSION NUMBER>:${PATH}"

# Remove the entrypoint from the container to work with AWS Batch
ENTRYPOINT [""]

Now build the Docker image and tag it with the URI for the ECR repository from step 1.

$ docker build -t <URI for your ECR repository> .

Upload the modified Docker image to ECR

Lastly, you will upload this modified version of the Parabricks container to the ECR repository.

aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin <URI for your ECR repository>

Now you are ready to kick off the Batch pipeline.

Using AWS Batch with AWS CloudFormation

AWS CloudFormation allows users to programmatically provision hardware resources. You can read more about it on the CloudFormation homepage. The templates provided in the Parabricks Quick Start Guide are configured to create the following architecture:

Architecture for running Parabricks jobs on AWS leveraging AWS Batch

Figure 2: Architecture for running Parabricks jobs on AWS leveraging AWS Batch

Here you have two EC2 instances, each in its own availability zone, ready to accept Parabricks jobs. The guide will download sample data and send one job to each EC2 instance.

Visit the deployment guide and click on Get Started. This will open the CloudFormation console with the template loaded into the GUI, ready to be filled in.

First, give this stack a name (Ex. “Parabricks Test”).

Enter stack name for AWS CloudFormation.

Figure 3: Enter stack name for AWS CloudFormation.

Under Network Configuration select two availability zones for this project. You can also configure which subnets to use. For most users, the defaults will work.

Figure 4: Enter parameters (availability zone and CIDR) for AWS CloudFormation.

Figure 4: Enter parameters (availability zone and CIDR) for AWS CloudFormation.

Under Parabricks Quick Start Configuration add the link to the URI for the Parabricks ECR repository you made in during the pre-requisite steps. Next for this section, select a Key Pair from the dropdown. This is so you can log onto the nodes that are running our jobs if you so desire. Select your instance type from the dropdown. All other fields can be left with their default values.

Figure 5: Enter configuration for Parabricks Quick Start.

Figure 5: Enter configuration for Parabricks Quick Start.

Under AWS Quick Start Configuration you can leave the defaults as they are.

Figure 6: Use default setting for Amazon S3.

Figure 6: Use default setting for Amazon S3.

You can skip the Configure stack options section and jump straight to the Review page. Review the options you have selected and check they are correct. Now you can click Create Stack which will take you to the stack creation page.

Figure 7: Dashboard to view status of stack creation with AWS CloudFormation.

Figure 7: Dashboard to view status of stack creation with AWS CloudFormation.

Once the stack has finished creating, you can see the jobs starting in the Jobs section of the AWS Batch dashboard.

And that’s it. Congratulations! You’ve successfully run Parabricks on AWS Batch using AWS CloudFormation. These templates are available on GitHub and can be adapted to scale up to more nodes and to more complex pipelines.

Conclusion

In this blog post, we demonstrated how to use the Parabricks Quick Start on AWS, leveraging AWS Batch and AWS CloudFormation, for secondary analysis of next-generation sequencing data. Visit the Parabricks Quick Start Deployment Guide and try out the demo for yourself. For more information on Parabricks check out the homepage and documentation.

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.

Gary Burnett

Gary Burnett

Gary Burnett is a Technical Marketing Engineer at NVIDIA. He helps customers use CLARA Parabricks software to accelerate their genomics pipelines. Gary received his bachelor’s degrees in Computer Science and Neuroscience from MIT and is currently working towards a master’s degree in Biomedical Informatics at Stanford University.

Olivia Choudhury

Olivia Choudhury

Olivia Choudhury, PhD, is a Senior Partner Solutions Architect at AWS. She helps partners, in the Healthcare and Life Sciences domain, design, develop, and scale state-of-the-art solutions leveraging AWS. She has a background in genomics, healthcare analytics, federated learning, and privacy-preserving machine learning. Outside of work, she plays board games, paints landscapes, and collects manga.