Protein Structure Prediction at Scale using AWS Batch

This post was contributed by Chris Downing, Senior HPC Consultant, AWS Professional Services; Harish Khataukar, Engagement Manager, AWS Professional Services; Per Greisen, Director of Computational Drug Discovery, Novo Nordisk; and Tanmoy Sanyal, Computational Drug Discovery Scientist, Novo Nordisk.

Introduction

Neural networks have been used to predict 3D protein structures from primary sequences for many years with limited success, but it was not until CASP13 in 2018 that the DeepMind AlphaFold project showed that deep learning could revolutionize protein prediction. By 2020 – years before even the most optimistic estimates – AlphaFold had begun to truly revolutionize protein structure prediction by approaching experimentally determined structures for globular proteins. This success has transformed the whole field of structural biology overnight and spawned an ecosystem of modified versions of AlphaFold to solve related problems in structural biology ranging from peptide docking to protein design.

Structure prediction has impacted the Healthcare and Life Sciences (HCLS) industry in multiple areas from early drug discovery and drug design to testing biological hypothesis – and hence there is a growing need for scientists working in industry to access and utilize tools such as AlphaFold.

In November 2021, an AWS blog post was published on how to run AlphaFold on Amazon Elastic Compute Cloud (Amazon EC2) instances. Another post published in early 2022 details a scalable solution implemented with AWS Batch for OpenFold, another breakthrough structure prediction algorithm based on AlphaFold. In this blog post, we describe a solution implemented by Novo Nordisk with the help of AWS Professional Services, which takes the concepts described in the two posts mentioned above and extends them to suit an enterprise environment – where governance requirements for the platform and the data need to be considered in addition to the scientific workflow.

Challenge

Novo Nordisk began its work with AlphaFold in early 2022 by deploying a containerized version of the application to AWS ParallelCluster. This platform gave researchers easy access to a familiar HPC environment, enabling fast experimentation and learning. However, it soon became clear that rapid adoption of AlphaFold by researchers would necessitate adopting a solution where the workflow could be easily templated and managed by the HPC platform team, with users only responsible for the scientific inputs and outputs.

The new solution needed to be able to support multiple availability zones and instance types (due to the volume of jobs users wanted to run), including the possibility for different instance types to be used for different stages of a job. At the same time, the solution needed to be integrated with the pre-existing data lake built by Novo Nordisk on Amazon Simple Storage Service (Amazon S3), so that researchers could collaborate effectively by sharing access to AlphaFold inputs and outputs. The team also wished to implement a simplified user interface, so that researchers would not need to learn any specific details about AWS infrastructure in order to be able to execute containerized AlphaFold jobs using AWS Batch. Finally, maintenance and operations were required to follow common best practices – using Infrastructure-as-Code wherever possible, and including a CI/CD pipeline for the infrastructure deployment which includes a development/test environment in a separate account – only when a manual approval step had been performed should changes be pushed to the production environment.

Solution

Supported by AWS Professional Services, Novo Nordisk chose to deploy a solution based on AWS Batch compute resources, Amazon FSx for Lustre storage, and AWS Step Functions for job orchestration. Users wished to submit jobs via a programmatic interface (e.g. by writing their own Python code which submits many jobs after pre-processing), so Jupyter notebooks were deployed and a simple Python library prepared to make the job submission process easier.

The deployment process used AWS Cloud Development Kit (AWS CDK) Pipelines to create the required infrastructure, orchestrated via an AWS CodeCommit repository and AWS CodeBuild jobs. Implementing the deployment workflow within AWS CodePipeline allows for change controls applied to both the infrastructure and new container image builds. With this solution, the Novo Nordisk HPC team can trigger the building of containerized applications (Alphafold and also other tools) or make changes to the AWS Batch components of the platform (i.e. the compute environments, queues and job definitions) simply by pushing updated code to a git repository.

Figure 1 – Architecture diagram for the AlphaFold solution. CI/CD components are hosted in a separate account. Public resources (e.g. the AlphaFold GitHub repository and Nvidia CUDA base container image) are copied into this account. During the deployment process, CDK expressing the AWS Batch, FSx for Lustre and Step Functions configurations is executed – the resulting infrastructure is launched into a separate account for testing, followed by redeployment to a production account. Jupyter notebooks are deployed separately on an as-needed basis.

The solution was constructed based on the following core concepts:

Orchestration

The AlphaFold workflow consists of two distinct steps, Multiple Sequence Alignment (MSA) and structure prediction. In order to allow for the separate optimization of compute resources for each step, the workflow is expressed as a Step Function consisting of two separate AWS Batch job submissions. The Step Functions workflow also includes a precursory step to validate input data formatting and move the data from a location in S3 specified by the user into the application “scratch” space and later move the results back once a job is complete. In the event of a job failure in either step, AWS Lambda functions are executed to perform clean-up tasks, such as the removal of intermediate data from the scratch file-system.

Compute

Requirements for scalability and repeatability are addressed using AWS Batch, with compute environments deployed to use all availability zones of the target region. Both the MSA and prediction steps are configured to use a range of instance types in order to maximize the available EC2 capacity. Compute environment allocation strategies are configured as “best fit progressive”, meaning that the lowest cost instance types among the specified lists are used whenever available.

As the MSA step configured in this workflow does not receive significant performance gains when executed on a GPU-enabled EC2 instance, using CPU-only instance types for the first step can deliver cost savings of around 40% relative to the same workflow executed end-to-end on a GPU instance.

Storage

The solution includes two usage models for FSx for Lustre. One file-system per Availability Zone is deployed as a static read-only cache of the AlphaFold database, linked to an Amazon S3 bucket via a Data Repository Association (DRA). This setup allows for periodic database updates as new protein structure data becomes available. The second set of file-systems are also deployed one per Availability Zone, and are used as “scratch” space for running jobs. These are linked to a matching “scratch” S3 bucket via another DRA – data is moved in and out of the scratch space by populating and clearing the S3 bucket, with changes propagated to FSx for Lustre.

Input files start by residing in an S3 bucket which is part of the enterprise data lake, and are copied to the scratch bucket by a Lambda function. The DRA enables files placed in the S3 bucket to be quickly made available via FSx for Lustre. Once the AlphaFold workflow is completed, the same DRA causes files written to FSx for Lustre to be copied to the scratch bucket. Outputs are subsequently copied from the scratch bucket by another Lambda function to a target output location in another data lake bucket or prefix. Once all outputs are copied to their target destination, a clean-up Lambda function removes redundant files from the scratch S3 bucket, and the DRA causes the same files to also be removed from the Lustre file-system.

Management

CI/CD pipelines were implemented using CodePipeline for both the AWS CDK based infrastructure deployment and the container build process. The overall environment consists of separate development and production accounts, with the back-end deployment processes hosted in a third account. The goal of this separation is to minimize the risk of disruptive accidental changes when the platform developers wish to make updates, such as expanding the Step Functions workflow.

Updates to the CDK and container definition repositories result in automated deployments to the development account, and execution of an approvals process requiring manual validation before subsequently deploying to the production account. Container images generated by the build process are stored within an Amazon Elastic Container Registry (Amazon ECR) in the deployment account and shared to the other accounts, with consistent tagging maintained across each repository.

Reproducibility

In order to avoid unplanned changes and configuration drift during the container image development and update process, key components are cached within the tooling account. A clone of the base image is retained within ECR, and copies of the GitHub-hosted repositories and other static resources are stored in an S3 bucket. During the container image build process, local copies of these files are used rather than downloading new copies from the internet.

User interface

Researchers need to be able to submit jobs in a high-throughput fashion, without having a lot of prior knowledge about AWS Batch or Step Functions. Since most users are comfortable working in a Jupyter notebook, the solution includes a small Python client library (developed as part of this engagement) which provides an abstraction layer for AlphaFold submission. The client library effectively provides a simplified AlphaFold interface; most of the relevant AlphaFold parameters (such as file-system paths for database inputs) have defaults which are configured as part of the Step Function definition. As a result, the user is only required to provide a path to one or more input files located in an S3 bucket in order to launch a job. The same client library also provides monitoring functionality, so that users can track the state of jobs after submission.

Outcome

By deploying a complete workflow orchestration solution based on AWS Batch and Step Functions, Novo Nordisk was able to provide a highly scalable compute environment and enable significantly greater job throughput for researchers exploring the use of AlphaFold, while also delivering significant cost savings relative to the previous workflow.

The adoption of Infrastructure-as-Code, CI/CD pipelines and a deployment approval process all help to enable the solution to be both reproducible and consistent with the business’ operating procedures, while minimizing the ongoing maintenance and development burden. Integration with the enterprise data lake enables AlphaFold outputs to be shared with other researchers across the company in a low-friction manner.

A customized interface library allows users to interact with the solution with very little barrier to entry; only a few lines of code are needed to submit and monitor AlphaFold tasks, and the library can be incorporated into other tools in order to automate the submission and post-processing of a large number of jobs.

Finally, the architecture and design principles adopted mean that Novo Nordisk can rapidly on-board new containerized scientific computing workloads to AWS Batch, with a consistent experience for both users and developers.

If you would like to learn more about using AWS Batch for your HPC workload, you can read more about the Batch service or try it out for yourself by completing a tailored step-by-step workshop. For support with deploying a solution like the one described here, reach out to AWS Professional Services.

AWS HPC Blog