AWS for Industries

Predicting protein structures at scale using AWS Batch

Proteins are large biomolecules that play an important role in the body. Knowing the physical structure of proteins is key to understanding their function. However, it can be difficult and expensive to determine the structure of many proteins experimentally. One alternative is to predict these structures using machine learning algorithms. Several high-profile research teams have released such algorithms, including AlphaFold2, RoseTTAFold, and others. Their work was important enough for Science magazine to name it the 2021 Breakthrough of the Year.

Both AlphaFold2 and RoseTTAFold use a multitrack transformer architecture trained on known protein templates to predict the structure of unknown peptide sequences. These predictions are heavily GPU-dependent and take anywhere from minutes to days to complete. The input features for these predictions include multiple sequence alignment (MSA) data. MSA algorithms are CPU-dependent and can themselves require several hours of processing time.

Running both the MSA and structure prediction steps in the same computing environment can be cost-inefficient because the expensive GPU resources required for the prediction sit unused while the MSA step runs. Instead, using a high-performance computing (HPC) service like AWS Batch allows us to run each step as a containerized job with the best fit of CPU, memory, and GPU resources.

In this post, we demonstrate how to provision and use AWS Batch and other services to run AI-driven protein folding algorithms like RoseTTAFold.

Previous blog posts have described how to install and run the AlphaFold 2 workload on AWS using Amazon Elastic Compute Cloud (Amazon EC2) instances. This is a great solution for researchers who want to interact with protein-folding algorithms in a long-running, highly customizable environment. However, teams who wish to scale their protein structure predictions may prefer a service-oriented architecture.

This project uses a pair of AWS Batch computing environments to run the end-to-end RoseTTAFold algorithm. The first environment uses c4, m4, and r4 instances based on the vCPU and memory requirements specified in the job parameters. The second environment uses g4dn instances with NVIDIA T4 GPUs to balance performance, availability, and cost.

A scientist creates structure prediction jobs using one of the two included Jupyter notebooks. AWS-RoseTTAFold.ipynb demonstrates how to submit a single analysis job and view the results. CASP14-Analysis.ipynb demonstrates how to submit multiple jobs at once using the CASP14 target list. In both of these cases, submitting a sequence for analysis creates two AWS Batch jobs. The first job uses the CPU computing environment to generate the MSA data and other features. The second job uses the GPU computing environment to make the structure prediction.

The data preparation and structure prediction jobs use the same custom Docker image, based on the public Nvidia CUDA image for Ubuntu 20. It includes the v1.1 release of the public RoseTTAFold repository as well as additional scripts for integrating with AWS services. AWS CodeBuild will automatically download this container definition and build the required image during stack creation. You can make changes to this image by pushing to the AWS CodeCommit repository included in the stack.

Walkthrough

Deploy the infrastructure stack

  1. Choose Launch Stack:
  2. For Stack name enter a value unique to your account and Region.
  3. For Stack Availability Zone choose an Availability Zone.
  4. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  5. Choose Create stack.
  6. Wait approximately 30 minutes for AWS CloudFormation to create the infrastructure stack and AWS CodeBuild to build and publish the AWS-RoseTTAFold container to Amazon Elastic Container Registry (Amazon ECR).
  7. Load model weights and sequence database files.

Option 1: Mount the FSx for Lustre file system to an EC2 instance

  1. Sign in to the AWS Management Console and open the Amazon EC2 console at https://console.aws.amazon.com/ec2.
  2. In the navigation pane, under Instances, select Launch Templates.
  3. Choose the Launch template ID for your stack, such as aws-rosettafold-launch-template-stack-id-suffix.
  4. Choose Actions, Launch instance from template.
  5. Launch a new EC2 instance and connect using either SSH or SSM.
  6. Download and extract the network weights and sequence database files to the attached volume at/fsx/aws-rosettafold-ref-data according to installation steps 3 and 5 from the RoseTTAFold public repository.

Option 2: Load the data from an S3 data repository

  1. Create a new S3 bucket in your Region of interest.
  2. Download and extract the network weights and sequence database files as described above and transfer them to your S3 bucket.
  3. Sign in to the AWS Management Console and open the Amazon FSx for Lustre console at, https://console.aws.amazon.com/fsx.
  4. Choose the File System name for your stack, such as aws-rosettafold-fsx-lustre-stack-id-suffix.
  5. On the file system details page, choose Data repositoryCreate data repository association.
  6. For File system path enter /aws-rosettafold-ref-data.
  7. For Data repository path enter the S3 URL for your new S3 bucket.
  8. Choose Create

Creating the data repository association will immediately load the file metadata to the file system. However, the data itself will not be available until requested by a job. This will add several hours to the duration of the first job you submit. However, subsequent jobs will complete much faster.

Once you have finished loading the model weights and sequence database files, the FSx for Lustre file system will include the following files:

/fsx
└-- /aws-rosettafold-ref-data
    ├-- /bfd
    │   ├-- bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata (~1.4 TB)
    │   ├-- bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex (~2 GB)
    │   ├-- bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata (~16 GB)
    │   ├-- bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex (~2 GB)
    │   ├-- bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata (~300 GB)
    │   └-- bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex (~120 MB)
    ├── /pdb100_2021Mar03
    │   ├── LICENSE (~20 KB)
    │   ├── pdb100_2021Mar03_a3m.ffdata (~630 GB)
    │   ├── pdb100_2021Mar03_a3m.ffindex (~4 MB)
    │   ├── pdb100_2021Mar03_cs219.ffdata (~40 MB)
    │   ├── pdb100_2021Mar03_cs219.ffindex (~3 MB)
    │   ├── pdb100_2021Mar03_hhm.ffdata (~7 GB)
    │   ├── pdb100_2021Mar03_hhm.ffindex (~3 GB)
    │   ├── pdb100_2021Mar03_pdb.ffdata (~26 GB)
    │   └── pdb100_2021Mar03_pdb.ffindex (~4 MB)
    ├-- /UniRef30_2020_06
    │   ├── UniRef30_2020_06_a3m.ffdata (~140 GB)
    │   ├── UniRef30_2020_06_a3m.ffindex (~670 MG)
    │   ├── UniRef30_2020_06_cs219.ffdata (~6 GB)
    │   ├── UniRef30_2020_06_cs219.ffindex (~600 MB)
    │   ├── UniRef30_2020_06_hhm.ffdata (~34 GB)
    │   ├── UniRef30_2020_06_hhm.ffindex (~19 MB)
    │   └── UniRef30_2020_06.md5sums (< 1 KB)
    └── /weights
        ├── RF2t.pt (~126 MB KB)
        ├── Rosetta-DL_LICENSE.txt (~3 KB)
        ├── RoseTTAFold_e2e.pt (~530 MB)
        └── RoseTTAFold_pyrosetta.pt (~506 MB)

Submit structure prediction jobs from Jupyter

  1. Clone the CodeCommit repository created by CloudFormation to a Jupyter Notebook environment of your choice.
  2. Use the AWS-RoseTTAFold.ipynb and CASP14-Analysis.ipynb notebooks to submit protein sequences for analysis.

Discussion

Data storage requirements

File system performance is one of the key challenges to running MSA algorithms in parallel. To address this, we use an Amazon FSx for Lustre file system to store the necessary sequence databases. When AWS Batch launches a new compute instance, it mounts the FSx file system in seconds. FSx then provides high-throughput access to the necessary data.

Please note that the template linked above creates a file system with 1200 MB/s total throughput, which can support dozens of simultaneous jobs. However, if your use case only requires one or two jobs at a time, you can modify the template to save cost. In this case, we recommend decreasing the throughput per unit of storage on the FSx for Lustre resource from 1000 to 500 MB/s/TiB.

Prediction performance

The RoseTTAFold paper reports requiring around 10 minutes on an RTX2080 GPU to generate structure predictions for proteins with less than 400 residues. We saw similar or better performance using Amazon EC2 G4dn instances with NVIDIA T4 GPUs.

For proteins with more than 400 residues, we recommend running the prediction jobs on instance types without GPUs. You can do this in the linked notebooks by updating the predict_job_definition and predict_queue fields parameters from this:

two_step_response = rfutils.submit_2_step_job(
    ...
    predict_job_definition=gpu_predict_job_def,
    predict_queue=gpu_queue,
    ...
)

To this:

two_step_response = rfutils.submit_2_step_job(
    ...
    predict_job_definition=cpu_predict_job_def,
    predict_queue=cpu_queue,
    ...
)

Support for other algorithms

Splitting the analysis workload into two separate jobs makes it easier to incorporate other algorithms for generating features and predicting structures. For example, you can replace hhblits with an alternative alignment algorithm like MMSeqs2 by updating the run_aws_data_prep_ver.sh script. The ParallelFold project is a good example of how to apply a similar multistep approach to the AlphaFold2 workflow. In this case, the featurization and prediction steps could each be containerized and run as AWS Batch jobs.

Cleaning up

  1. Sign in to the AWS Management Console and open the CloudFormation console at https://console.aws.amazon.com/cloudformation.
  2. Choose the Stack name associated to your stack.
  3. Choose Delete.

Conclusion

In this post, we demonstrated how to use AWS Batch and Amazon FSx for Lustre to improve the performance efficiency and cost of protein folding algorithms like RoseTTAFold. A template for deploying the AlphaFold2 algorithm on AWS Batch is now available at https://github.com/aws-samples/aws-batch-architecture-for-alphafold.

To learn more about how AWS supports life science organizations with high-throughput modeling and screening visit, aws.amazon.com/health/biopharma/solutions/

Brian Loyal

Brian Loyal

Brian Loyal is a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences team at Amazon Web Services. He has more than 16 years experience in biotechnology and machine learning and is passionate about helping customers solve genomic and proteomic challenges. In his spare time, he enjoys cooking and eating with his friends and family.

Scott Schreckengaust

Scott Schreckengaust

Scott has a degree in biomedical engineering and has been inventing devices alongside scientists on the bench since the beginning of his career. He loves science, technology, and engineering with decades of experiences in startups to large multi-national organizations within the Healthcare and Life Sciences domain. Scott is comfortable scripting robotic liquid handlers, programming instruments, integrating homegrown systems into enterprise systems, and developing complete software deployments from scratch in regulatory environments. Besides helping people out, he thrives on building -- enjoying the journey of hashing out customer’s scientific workflows and their issues then converting those into viable solutions.