Accelerate drug discovery with NVIDIA BioNeMo Framework on Amazon EKS

This post was contributed by Doruk Ozturk and Ankur Srivastava at AWS, and Neel Patel at NVIDIA.

Introduction

Drug discovery is a long and expensive process. Pharmaceutical companies must sift through thousands of compound possibilities to find potential new drugs to treat diseases. This process takes multiple years and costs billions of dollars, with the majority of the candidates failing during clinical trials.

As generative artificial intelligence (generative AI) continues to transform industries, the life sciences sector is leveraging these advanced technologies to accelerate drug discovery. Generative AI tools powered by deep learning models make it possible to analyze massive datasets, identify patterns, and generate insights to aid the search for new drug compounds. However, running these generative AI workloads requires a full-stack approach that combines robust computing infrastructure with optimized domain-specific software that can accelerate time to solution.

In this blog post, we’ll show you how to leverage the NVIDIA BioNeMo platform on Amazon Elastic Kubernetes Service (Amazon EKS) to accelerate drug discovery by using generative AI and other machine learning technologies.

NVIDIA BioNeMo

NVIDIA BioNeMo is a generative AI platform for drug discovery that simplifies and accelerates the training of models using your own data. BioNeMo provides researchers and developers a fast and easy way to build and integrate state-of-the-art generative AI applications across the entire drug discovery pipeline—from target identification to lead optimization—with AI workflows for 3D protein structure prediction, de novo design, virtual screening, docking, and property prediction.

The BioNeMo framework facilitates centralized model training, optimization, fine-tuning, and inferencing for protein and molecular design. Researchers can build and train foundation models from scratch at scale, or use pre-trained model checkpoints provided with the BioNeMo Framework for fine-tuning for downstream tasks. Currently, BioNeMo supports models such as ESM1nv, ESM2nv, ProtT5nv, DNABERT, OpenFold, EquiDock, DiffDock, and MegaMolBART. To read more about BioNeMo, visit the documentation page.

Figure 1: This image shows the workflow for developing models on NVIDIA BioNeMo. The process is divided into phases for model development and customization and then fine-tuning and deployment.

In this post, we’ll walk through how to deploy the NVIDIA BioNeMo Framework on Amazon Elastic Kubernetes Service (Amazon EKS). Amazon EKS provides a fully managed Kubernetes service, making it simpler to run distributed, containerized generative AI workloads at scale. The process will cover:

Setting up an EKS cluster with NVIDIA GPU nodes
Leveraging Amazon FSx for Lustre for high-performance data storage and sharing
Downloading and ingesting Uniref-50 data in a machine learning friendly format
Running a distributed pre-training job to train the ESM-1nv model

Architecture

Figure 2: This architecture diagram shows Amazon EKS cluster with GPU nodes, Amazon FSx for Lustre filesystem, and BioNeMo containers. GPU nodes are optimized for machine learning workloads. The FSx filesystem enables fast access to data needed for distributed training. Amazon CloudWatch is used for logging and monitoring.

We leveraged the Amazon EKS Blueprints for Terraform and Data on EKS open source projects to build a robust Kubernetes infrastructure on EKS using Infrastructure as Code best practices. These projects enabled us to quickly stand up an EKS cluster while meeting security and operational excellence requirements.

In particular, the Terraform blueprints let us create a production-grade EKS cluster with networking, security groups, node groups, and other critical components out-of-the-box. The Data on EKS repository provided examples for running data-intensive workloads such as Apache Spark and TensorFlow on EKS. Together, these tools allowed us to launch an EKS cluster purpose-built for large-scale data processing in a reproducible and automated fashion. We can easily scale the cluster up to hundreds of nodes to handle compute-intensive jobs. Adopting these community-built modules accelerated our delivery timelines while ensuring the infrastructure remained secure, observable, and operationally robust as it scaled. The flexibility to customize the modules as needed also enabled us to tailor the infrastructure to the specific needs of our data workloads.

The NVIDIA BioNeMo on EKS Blueprint

We published all of the templates we used to deploy BioNeMo on EKS as a new Data on EKS blueprint on GitHub. We’ll continue to iterate and improve on this blueprint over time, so you should bookmark and refer to the official documentation on the Data on EKS website moving forward. However, we will discuss key configuration details and technical details here for reference.

Step one, as always, is to prepare the input data as a machine learning friendly structure. Uniref50 contains over 50 million unique protein sequences clustered from UniProt at 50% identity. This comprehensive set provides a foundation for vital tasks such as gene annotation and protein family prediction.

To leverage Uniref50’s scale, we downloaded and organized the data into training, validation, and test partitions. This layout enhances downstream machine learning by enabling effective model building, evaluation, and monitoring of overfitting. An Amazon FSx for Lustre shared filesystem provided the high-throughput storage needed for data sharing across the compute cluster nodes.

After data preparation, we are ready to run the pretraining job. It is important to leverage all available NVIDIA GPU resources for maximum performance. P5 instances powered by NVIDIA H100 Tensor Core GPUs offer the best performance, however, in this example, we used two p3.16xlarge instances with eight GPUs each for demonstration purposes. To utilize all 16 GPUs, we configured the PytorchJob custom resource with one GPU per replica:

resources:
  requests:
    nvidia.com/gpu: 1

We set replicas to match the total number of available GPUs across the worker nodes:

pytorchReplicaSpecs:
  Worker:
    replicas: 16

To allocate 8 processes per GPU node, we set:

nprocPerNode: "8"

By default, workloads that don’t require GPUs won’t be scheduled on our GPU nodes. To make sure our BioNeMo pods were scheduled on the GPU nodes, we added tolerations:

tolerations:
-  Key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

To utilize a larger cluster for BioNeMo analysis, change the above values to match the cluster size. For example, to fully leverage a cluster of 8 p3.8xlarge instances, each with 4 GPUs, you would need to change the following parameters:

Set “nprocPerNode” to 4 to indicate the number of GPUs available per node.
Set “replicas” to 32 to total the number of GPUs across the 8 nodes.

As each p3.8xlarge instance contains 4 GPUs, 8 instances x 4 GPUs per instance = 32 GPUs in the cluster.

By properly configuring these parameters according to the resources provisioned, you can efficiently parallelize the training across the all-available GPUs in the cluster. For more information on configuring Pytorch and Kubeflow for your specific needs, read the Pytorch distributed module documentation and the Kubeflow Training Operator’s documentation, respectively.

Conclusion

In summary, leveraging technologies such as NVIDIA BioNeMo on Amazon EKS can accelerate AI-powered drug discovery by orders of magnitude. By combining the power of generative models with robust infrastructure for distributed training, researchers can rapidly analyze massive protein datasets to uncover new drug compound candidates. Automating infrastructure deployment using Terraform modules and best practices for maximum GPU utilization, storage, and monitoring enables workloads to benefit from security, scalability, and observability. Subject matter experts are a key resource for drug discover, and purpose-built generative AI platforms on cloud-native infrastructure can augment their creativity and intuition. As this technology continues maturing, we may see further breakthroughs in delivering life-saving treatments to patients faster.

AWS HPC Blog