AWS for Industries
Genentech sustainably generates a large dataset on AWS to advance machine learning-based research on a potential new drug modality
Genentech is a biotechnology company that pioneered the biotech industry and revolutionized how some of the world’s most complex health problems are treated. Today, as a member of the Roche Group, the company remains dedicated to pursuing breakthrough research, developing life-changing medicines, unlocking advances in data and technology, and partnering across society to take on systemic issues that stand in the way of better healthcare for all.
Genentech is interested in macrocyclic peptides, an emerging class of therapeutics in drug discovery. They are attractive to researchers as they are able to bind to difficult-to-drug protein surfaces, offer unique pharmacological properties, and are readily synthesized. Although macrocyclic peptides offer many advantages, optimizing their properties, including cell permeability, remains challenging, as they are strongly influenced by their diverse and highly-dynamic three-dimensional structures. Modeling the full set of possible structures, known as their conformational ensembles, is computationally challenging due to flexibility constraints imposed by cyclization and their large number of rotatable bonds. In the area of machine learning for drug discovery, large computational datasets have become common for small molecules and proteins, but no large-scale datasets previously existed for macrocyclic peptide conformations.
To enable better modeling of macrocycle binding and permeability, Genentech set out to generate a large, diverse, and accurate dataset of macrocyclic peptide conformational ensembles. To generate this novel computational dataset, CREMP (Conformer-Rotamer Ensembles of Macrocyclic Peptides), CREST (Conformer-Rotamer Ensemble Sampling Tool), an open source package by the Grimme group, was used. CREST leverages an iterative meta-dynamics algorithm using semi-empirical quantum mechanical calculations with a genetic structure-crossing algorithm to explore diverse geometries and provide better geometries and energy estimates than classical force fields. Initial explorations by researchers had demonstrated the potential for CREST to generate diverse macrocycle ensembles that recapitulate key intramolecular hydrogen bonds and the feasibility for ring interconversion. This is expected to help in the development of novel therapeutics.
Generating the CREMP dataset was a very CPU intensive task requiring millions of vCPU hours. AWS Batch was chosen to run this task due to its ability to run hundreds of thousands of batch computing jobs while optimizing compute resources. AWS Batch dynamically provisions and scales Amazon Elastic Container Service (ECS)-based compute resources with an option to use On-Demand or Spot Instances based on job requirements. Customizable and scalable code incorporating CREST had to be packaged as a Docker container to run jobs in AWS Batch. Amazon Elastic Container Registry (ECR) was used as the repository to store and serve the CREST container image as needed. Amazon Simple Storage Service (S3), an object storage service, provided storage of the generated data.
Genentech has a sustainability goal of reducing its greenhouse gas emissions by 75 percent from 2020 – 2029, compared to 2019, without impacting the pace of innovation in drug discovery. Graviton-based Amazon Elastic Compute Cloud (EC2) instances were used to generate the CREMP dataset. Graviton2-based compute optimized C6g EC2 instances were initially chosen as they provide up to 40 percent better price performance over comparable fifth generation x86-based instances in Amazon EC2. This decision meant that a new docker image for CREST-based code had to be built for ARM-based Graviton2. Genentech switched to Graviton3-based C7g instances which offered 25% higher performance than C6g instances while consuming 60% less energy for the same performance as comparable Amazon EC2 instances. The AWS US West (Oregon) region, used by Genentech for generating the CREMP dataset, is also powered by 100% renewable energy. This decision to go with energy-efficient Graviton2 and Graviton3-based EC2 instances and the choice of the AWS US West (Oregon) region contributed to Genentech’s sustainability goal.
Using AWS Batch with Graviton2 and Graviton3-based Amazon EC2 instances and Amazon S3 enabled scaling calculation tasks to more than 10,000 parallel jobs, which collectively utilized 3.9 million vCPU hours to generate the CREMP dataset in two weeks. The following diagram shows the architecture used to generate the CREMP dataset. Based on guidelines from the security pillar of the AWS Well-Architected Framework, basic image scanning enabled by default in Amazon ECR was used in this architecture to scan operating system (OS) packages used in the CREST container image for Common Vulnerabilities and Exposures (CVE), a public list of known security threats. Other security controls included implementing the principle of least privilege access to limit access, disabling public access to Amazon S3 buckets, enabling server-side encryption for data at rest in Amazon S3 and configuring AWS Batch to spin up EC2 instances in a private subnet to isolate traffic from the internet.
This computational workflow generated a set of 36,198 macrocyclic peptide molecules encompassing nearly 31.3 million unique conformers with energy annotations. It is expected that this first-of-its-kind dataset will pave the way for improved computational macrocyclic peptide design and play an important part in advancing machine learning based research to develop novel therapeutics.
The following diagram is a simplified representation of the conformer ensemble of one of the cyclic peptides.Macrocyclic peptide and a small ensemble generated with CREST
Conclusion
Genentech has made the CREMP dataset publicly available to provide the research community with a valuable data resource to train and develop machine learning models, and anticipates that it will find broad utility for learning models capable of predicting macrocyclic peptide properties and structure, such as binding affinity, stability, and permeability, among others. Genentech is hoping that the insights gained from these models can be used for rational design of novel macrocycles with improved therapeutic potential. It is also anticipated that this work will not only accelerate the development of new macrocyclic therapeutics but also contribute to a deeper understanding of the factors governing their conformational behavior, paving the way for more efficient computational approaches in the future.
Here are some resources to learn more about HPC for Life Sciences research, drug discovery, sustainability, and Graviton.
Further Reading
Here are some resources to learn more about HPC for Life Sciences research, drug discovery, sustainability, and Graviton.