Guidance for Protein Folding on AWS
Overview
This Guidance helps researchers run a diverse catalog of protein folding and design algorithms on AWS Batch. Knowing the physical structure of proteins is an important part of the drug discovery process. Machine learning (ML) algorithms significantly reduce the cost and time needed to generate usable protein structures.
These systems have also inspired development of artificial intelligence (AI)-driven algorithms for de novo protein design and protein-ligand interaction analysis. This Guidance will allow researchers to quickly add support for new protein analysis algorithms while optimizing cost and maintaining performance.
How it works
These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.
Well-Architected Pillars
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
Operational Excellence
Customers deploy architecture components using CloudFormation. Solution changes are tested and deployed using GitLab pipelines. Customers can submit jobs and process the results through a Python software development kit (SDK), including jobs from Jupyter notebooks. Jobs write all results and metrics to Amazon S3.
Security
All analysis jobs run within private subnets and use minimal AWS Identity and Access Management (IAM) policies to manage access to AWS services. All data is encrypted at rest and in transit. Amazon S3 data transfer occurs through a VPC endpoint.
Reliability
Analysis algorithms are split into independent containers and Python classes for modular execution and updates. AWS Batch automatically provides job retry logic. Job inputs and outputs are stored in Amazon S3. Additionally, the CloudFormation template provisions an attached data repository for the FSx file system to rapidly restore reference data.
Performance Efficiency
Protein folding algorithms require large sequence databases for data preparation and can take several minutes or hours to finish. AWS Batch supports FSx for Lustre mounts and extended run times. Both AWS Batch and Amazon FSx for Lustre support HPC use cases, such as protein folding with high input/output (IO) requirements.
Cost Optimization
AWS Batch will automatically de-provision compute resources when jobs are finished. Customers can leverage Amazon Elastic Compute Cloud (Amazon EC2) Spot instances (which offer up to a 90% discount compared to On-Demand instances) and AWS Graviton-enabled instance types for some jobs. AWS Graviton instances are optimized for cloud workloads and can deliver up to 40% better price performance over comparable current generation x86-based instances.
Sustainability
AWS Batch automatically scales compute resources to handle jobs in a managed queue. This architecture includes benchmarking results and default parameters to minimize hardware resources.
Implementation Resources
Disclaimer
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages