How Evolvere Biosciences performs macromolecule design on AWS

This post was contributed to by George Wicks, Adam Winnifrith, Piotr Jedryszek, and Weronika Slesak, from Evolvere Biosciences, and Joshua Broyde from AWS.

Engineering of biology to specify and predict functions of macromolecules based on sequences of amino acids, nucleic acids, or sugars, has the potential to open a world of possibilities in therapeutic design. Over the past few years, a number of deep learning and generative AI approaches have added computational complexity to this essential task. There are several types of these structure-based workloads with a few common ones shown in the figure below.

This figure shows 4 different types of workloads commonly done for protein structure analysis and design: 1. protein folding 2. protein design 3. Inverse protein folding and 4. predicting protein-protein interactions.

Figure 1: Four different types of workloads commonly used in protein structure analysis: 1. protein folding 2. protein design 3. Inverse protein folding and 4. predicting protein-protein interactions

However, there are a number of important steps that must be done in order to execute such a workload successfully. This can include the prediction of an original protein structure(s); deep learning-based redesign of an input protein, further screening and processing of the structures and other refinement. A key problem frequently faced by pharmaceutical companies and others running such workloads is that it becomes very difficult to orchestrate the workloads so that the experiments are consistent, reproducible, and scalable.

In this blog post, we show how Evolvere Biosciences was able to build and deploy its protein design platform on AWS. Critically, the protein design platform is able to take advantage of AWS compute (AWS Batch) and storage services (Amazon S3) while also using 3rd party orchestration tools; specifically, Evolvere Biosciences deployed Nextflow as its orchestration tool of choice. We also include sample code, which can be found here.

About Evolvere Biosciences

Computational generative protein design is a core component of Evolvere Biosciences’s first principles approach to redesigning antibacterial medicines for the 21^st century. Evolvere Biosciences is building a platform that rapidly generates proteins that can target any chosen bacteria with precision, while minimizing the emergence of resistance, have minimal side effects, and have low-dosing regimes. Evolvere Biosciences wants to open-source the tools that they develop to encourage, equip, and inspire other problem solvers like us. Evolvere Biosciences is a recipient of investment from the BioEscalator part of an AWS Activate program for Startups.

Architecture

Evolvere Biosciences customized the AWS Solutions Library Guidance for Protein Structure Prediction for its scientists to use when deploying protein design workloads. As described in that guidance, this architecture leverages AWS CloudFormation templates to deploy infrastructure for protein prediction and design.

The architecture used by Evolvere Biosciences is shown in the diagram below:

Figure 2: This diagram shows the architecture deployed by Evolvere Biosciences for its antibody design pipeline. Note that steps 5,6,7,8 and 9 diverge from AWS Solutions Library Guidance for Protein Structure Prediction to accommodate using Nextflow as an orchestrator, which in turn submits jobs to AWS Batch.

AWS CloudFormation deploys the infrastructure in an AWS account.
AWS CodeBuild builds the containers necessary to run algorithms such as AlphaFold and OpenFold.
AWS Lambda triggers the download of model artifacts and reference data to an Amazon FSx for Lustre file system.
Users define and submit analysis jobs from an Amazon SageMaker Notebook Instance or other Python environment.
Evolvere Biosciences scientists create custom images for Nextflow orchestration and macromolecule analysis and push them to Amazon Elastic Container Registry (ECR).
Data scientists submit input to Amazon S3. This includes:
1. Raw data, such as Fasta and PDB files.
2. A .nf file containing instructions for the Nextflow orchestrator. This can optionally include custom dependencies and other scripts as well.
The user then submits the AWS Batch job to run the Nextflow script. Note that while in development this is done by a data scientist, this step can also be triggered by a service such as AWS Lambda or a 3rd party application.
The orchestrator runs the pipeline. In order to minimize costs, the orchestrator runs on a small, non-accelerated instance type.
The Nextflow orchestrator creates new jobs within AWS Batch. These steps can be computationally complex and involve multiple steps on GPU and CPU nodes.
As jobs progress, they write data to intermediate Amazon S3 locations for processing by future steps.
The final output is written to an Amazon S3 bucket when the pipeline is finished.

Orchestration of Macromolecule Design Pipelines

AWS provides a number of managed orchestration tools, such as AWS Step Functions, Amazon MWAA, and Amazon SageMaker Pipelines . Due to its popularity within the bioinformatics community, Evolvere Biosciences chose to deploy Nextflow as its orchestration tool. Nextflow easily runs AWS Batch jobs and can also integrate with Amazon Omics Workflows. Workflow orchestration provides a number of critical advantages for Evolvere Biosciences. Key ones are:

Decoupling steps of the pipeline: While some of the steps in the pipeline involve extremely high compute or GPU-based jobs; others require fewer resources. Decoupling the steps with an orchestrator allows each step to consume only the resources needed for that process.
Passing pipeline steps as parameters: Different macromolecule experiments require different parameters depending on the experiment. Orchestration of the pipeline allows for scientists to select programmatically which steps to run, and which hyperparameters to run each step with.
Integration with other services: By building its protein design with an orchestration tool, Evolvere Biosciences can connect its pipeline to other AWS services, such as AWS Lambda and Amazon EventBridge. This enables other applications to automatically trigger the pipeline, maximizing scalability.

This orchestration job has permission to submit new jobs to other AWS Batch queues. As each step of the process is finished, the orchestrator job evaluates the results and executes the next steps. In order to better scale the deployment of Nextflow, Evolvere Biosciences uses the AWS Batch Squared architecture. The orchestrator job does not itself perform any of the protein structure predictions.

Contents of the Pipeline

Evolvere BioSciences aims to make an open source, ‘state of the art’ protein design pipeline that is easy to use by scientists across the world. They leverage a number of key steps for generating designed proteins. The overall logic of the approach is as follows:

Figure 3: An overview of the steps used in the Evolvere Biosciences Macromolecule Design Pipeline

In the first iteration of their design pipeline, protein design tools such as RFDesign and RFDiffusion are used as the generative component for motif scaffolding.

First the design criteria are specified in an Amazon SageMaker Notebook.
A protein design tool then generates a specified number of designed proteins that meet the applied constraints.
Inverse folding algorithms, such as ProteinMPNN and ESM-IF, are then used to increase the designability of the protein design tool outputs.
Protein structure prediction algorithms like AlphaFold and ESMFold predict and score the “foldability” of the candidate generated.
Finally, additional opensource models and custom biophysical models are used to evaluate the structures.

This pipeline is not required to be linear; and instead allows for custom logic where steps can be repeated based on the specific needs. For example, structures may undergo multiple rounds of refinement as required for a specific instantiation of the pipeline. In addition the orchestrator can either submit jobs in a serial fashion, or also perform fan-out operations, where the input of one step itself kicks off multiple jobs.

Sample Code

This repository here contains code to use Nextflow for orchestrating protein design jobs on AWS Batch. In that example, you deploy a sample pipeline for first running RFDesign to design novel protein structures, followed by a fan-out operation to run ESMFold on the generated structures.

Conclusion

Scale, ease of use, and reproducibility are one of the most pressing needs in computational biology. The increasingly complex nature of orchestrating computational workloads (such as deep learning and generative AI approaches for macromolecule analysis) has posed additional challenges as organizations drive to achieve harmonization and standardization. In this blog post, we show how Evolvere Biosciences has deployed a customized architecture using AWS Batch, and Nextflow to quickly and easily run its macromolecule design pipeline.

AWS HPC Blog