Running protein structure prediction at scale using a web interface for researchers

This post was contributed by Chiaki Ishio, Solutions Architect, Daisuke Miyamoto, Senior Specialist Solutions Architect, Compute/HPC, Shingo Chiyoda, Solutions Architect, Daiki Kuriyama, Senior Prototyping Engineer, AWS Japan

Introduction

Unraveling the proteins structure is crucial not only for understanding its function, but also for drug discovery. Protein structure prediction software like AlphaFold2 and OpenFold have become indispensable to researchers since their introduction in 2021. However, researchers may find it complicated to build an environment to run this software and manage the hardware. IT admins also may find it laborious to secure and control the computing resources for the researchers.

In this blog post, we present a sample implementation to solve these problems. It consists of a purpose-built web frontend and a cloud HPC backend environment. Using this implementation, researchers can run protein structure prediction jobs at the backend HPC cluster through an easy-to-use web frontend.

IT admins can reduce the time and effort required to prepare computing resources by taking advantage of a scalable HPC environment, and they can also easily manage permissions by restricting user access to the web interface. In this way, the cloud is an especially productive environment for HPC, because we can combine a variety of interfaces with serious computing horsepower, depending on the purposes and our users’ needs.

Overview of our sample application

In this section, we’ll take a look at a sample application published as an AWS Samples (see the link for the deployment method).

This implementation provides a web application, as shown in Figure 1. Through this web frontend, you can easily run three-dimensional protein structure prediction using AlphaFold2. To run a job, input the FASTA format text at the screen top and click the ‘Create Job’ button. The job execution status is displayed in the list below the text field. You can also visualize the results of completed jobs and download the results in PDB format.

Figure 1 – A sample web frontend application. Researchers can enter FASTA format text into a simple form and obtain results of the protein structure prediction following a short wait.

Architecture: HPC cluster integrated with web frontend

What is happening behind the scenes when you use the web frontend above? We’ll illustrate the process using an architectural diagram (Figure 2). Note that the numbers in the following text match the numbers in the diagram.

When you interact with the web frontend, it sends an instruction to the compute environment via an API (1). For example, when you start a job with a FASTA format text, the file gets stored in an S3 bucket via Amazon API Gateway and AWS Lambda (2). Also, via AWS Systems Manager, the AlphaFold2 script is executed on the HPC cluster managed by AWS ParallelCluster (3). The databases required to run AlphaFold2 are stored in Amazon FSx for Lustre, a distributed file system which offers rapid access from the HPC cluster (4). Since FSx for Lustre synchronizes bi-directionally with the S3 bucket (5), FASTA files in the S3 bucket can be accessed from the Lustre file system. Finally, we use Slurm to manage jobs in ParallelCluster, and Amazon Aurora Serverless v1 to store the job execution history (6).

Figure 2 — Architectural diagram of a web application for predicting protein structures. When a FASTA file is submitted from the web frontend, the job is submitted to AWS ParallelCluster via Amazon API Gateway/AWS Lambda.

We’ve walked through the detailed architecture, but application users don’t need to be aware of the actual mechanisms behind the scenes. If you’re a researcher and you want to try protein structure prediction on the cloud, but have no access to the AWS management console in your organization, you can still use this web application.

Best of all, if you’re not familiar with HPC operations, you can still run the AlphaFold2 job via the GUI.

Deep Dive on components in the architecture

So far, we’ve explained the architecture at a high level. In this section, we’ll pick up some AWS services and components in this application.

AWS CDK: The entire web application is implemented using the AWS CDK (Cloud Development Kit), a tool that allows you to define a cloud application as code. If you’re the IT admin, you can build this almost automatically. First, you deploy the backend and frontend with the CDK, connect the two, then launch an HPC cluster and configure a protein structure prediction tool (detail steps are described in the sample repository) to complete the entire application.

AWS ParallelCluster: Now, we’ll take a closer look at the options for building the HPC cluster environment. Both AWS ParallelCluster and AWS Batch are suitable compute environment options for running protein structure prediction software as in this blog post. Here’s a brief recap of each service: ParallelCluster makes it easy for you to deploy and manage HPC clusters on AWS, and is recommended for those who are familiar with any HPC cluster environments. Meanwhile, AWS Batch allows you to run batch computing jobs at any scale, and it automatically provisions computing resources. It is recommended for users who are comfortable using containers. Based on the characteristics of these two options, the reason we chose ParallelCluster is flexibility. That is, even if new software other than AlphaFold2 or ColabFold are released in the future, they won’t always immediately support container environments; ParallelCluster is useful in these kinds of cases.

Cost: Cost is also an important perspective when considering any architecture. The sample application described above includes two types of costs: recurring and job-execution based. The breakdown of each cost type is described in the next paragraph (note that these cost calculations are based on US East (Northern Virginia) Region pricing as of the publish date of this blog post).

First, the recurring cost is about $290 per month, mostly due to FSx for Lustre. In this sample application, we applied the data compression setting in order to minimize the cost of FSx for Lustre. The remainder of the cost is associated with ParallelCluster: an EC2 instance acting as the head node, a NAT gateway, and the Aurora Serverless v1 that manages the job history. If you choose Batch instead of ParallelCluster for your HPC environment, you could reduce the recurring cost associated with ParallelCluster’s head node.

Next, let’s look at the cost incurred per-job for execution. For example, using g4dn.2xlarge for the GPU instance costs 0.752 USD per hour. The cost is proportional to the time required to run the job. Note that unit pricing for AWS services, including EC2 instances, varies by region. You can consider selecting a suitable region to optimize your costs.

Conclusion

In this blog post, we described a sample implementation of a web application that can run protein structure prediction jobs with AlphaFold2 via GUI, using AWS ParallelCluster as backend. This web application is built to be easy for researchers to use and for IT admins to manage the infrastructure.

Also, as we discussed, you can use this implementation interchangeably with other protein structure prediction software. That is, you can replace the backend software with alternatives, like OmegaFold and OpenFold. And you can customize and extend the frontend, like offering users options for which software to use for the protein structure prediction jobs.

If you’re an IT admin, you might want to add a user-management mechanism with Amazon Cognito or a cost capping mechanism using AWS Budgets. AWS allows you to combine these services and build an HPC environment in the form of a solution. We hope you will bring your own ideas to improve this implementation to build an even better research environment.

For more background on the protein folding topic generally, check out some of our other posts, including “Predicting protein structures at scale using AWS Batch (architecture for running RoseTTAFold on AWS Batch)” and the one about optimizing protein folding costs with OpenFold on AWS Batch (benchmark results).

Acknowledgements

Special thanks go to the following members at AWS Japan for their contributions in building the sample application presented in this post: Judge Saki Ito (Solutions Architect), Tomochika Kato (Solutions Architect), Fuminori Abe (Solutions Architect), Hitoshi Anji (Manager, Solutions Architect), Hokuto Akimoto (Account Manager), Takehiro Suzuki (Prototyping Engineer), and Hiroshi Kobayashi (HPC Specialist Solutions Architect).

AWS HPC Blog