Find the Next Blockbuster with NVIDIA BioNeMo Framework on Amazon SageMaker

The NVIDIA BioNeMo framework accelerates drug discovery by giving researchers access to powerful foundation models (FMs) for biology. By integrating BioNeMo with Amazon SageMaker, researchers can develop cutting-edge AI solutions with the scalability, data security, and operational excellence they expect from AWS.

Introduction

Drug R&D is a long, costly, and risky process. On average, it takes about 12 years and costs over $2 billion to bring a new drug to market. To combat high failure rates and bring differentiated therapeutics to market faster, pharmaceutical companies are looking to artificial intelligence (AI) and machine learning (ML).

Recent advances in generative AI have led to breakthroughs in foundation models (FMs) for proteins, small molecules, and nucleic acids. After training on massive amounts of data, these models develop internal representations of sequences, structures, and evolutionary relationships. Scientists can then adapt them for applications like predicting molecular structures, identifying protein-ligand docking pockets, and screening for chemical properties.

The NVIDIA BioNeMo framework is a generative AI platform designed for drug discovery. It provides versatile capabilities for training and fine-tuning large language models to understand protein, small molecule, and nucleic acid data. At re:Invent 2023, AWS and NVIDIA announced support for BioNeMo on several AWS services, including Amazon SageMaker, AWS ParallelCluster, and Amazon Elastic Kubernetes Service. Life science teams can now use scalable compute resources on AWS with the BioNeMo framework to rapidly build AI solutions for biomolecular data.

In this post, we’ll explore how customers can leverage the BioNeMo framework on Amazon SageMaker to enhance and accelerate drug R&D. You can find model training and inference code examples on GitHub.

Challenges to Applying Generative AI to Drug Discovery

Early-stage drug development involves target discovery and validation, lead generation and screening, lead optimization, and candidate selection for preclinical development. Each of these stages has unique requirements for AI systems. To improve adoption of these tools, some foundation model developers have openly released pre-trained checkpoints for the scientific community. This includes popular models such as ESM and OpenFold. However, the accuracy of these models can decrease when the training and inference data come from different distributions. This can occur, for example, when predicting the characteristics of monoclonal antibodies or proteins designed de novo.

Fine-tuning these models on use case-specific datasets can improve accuracy. However, this requires large amounts of protein sequence, chemical structure, DNA/RNA sequence, assay data, images, text, and other data. This information is often spread across internal data lakes and external repositories and requires manual curation and cleaning. Instead of building these workloads from scratch researchers need efficient platforms to pre-train, fine-tune, integrate, and serve various ML models for drug discovery.

Solution: NVIDIA BioNeMo Framework on Amazon SageMaker

The BioNeMo framework on Amazon SageMaker combines the performance and accessibility of BioNeMo framework modules with the flexibility, security, and service integrations of SageMaker.

Figure 1: This image shows the workflow for developing models on NVIDIA BioNeMo. The process is divided into phases for model development and customization and then fine-tuning and deployment.

The key capabilities include:

Optimized performance: Enhanced hyperparameter and checkpoint management, along with data and model parallelism, enables near-linear scaling efficiency across hundreds of GPUs.

Growing Model Catalog: The BioNeMo framework contains a growing list of model architectures optimized and maintained by NVIDIA. Examples include models for generating protein and DNA sequence embeddings (ESM1nv, ESM2nv, ProtT5nv, DNABERT), protein structures (OpenFold), novel small molecules (MegaMolBART), and molecular docking simulations (DiffDock, EquiDock).

Pretraining and fine-tuning foundation models: The BioNeMo framework provides container images and configuration files for training foundation models. This includes low-code options for parallelizing computation across multiple GPUs and nodes. Users can share configuration files and training checkpoints across different BioNeMo environments, such as DGX Cloud on AWS and SageMaker, to better manage capacity.

Accelerated Compute Options: AWS and NVIDIA have been strong partners for over thirteen years. AWS was the first to bring NVIDIA GPUs to the cloud, and the first to offer A100 and H100 GPUs in production. At re:Invent 2023 AWS and NVIDIA introduced P5e instances for large-scale generative AI and G6 instances for fine-tuning and inference workloads. These options ensure that customers can find the right instance types on AWS for their needs.

Fully managed, scalable infrastructure: Amazon SageMaker Model Training reduces the time and cost to train ML models at scale without the need to manage infrastructure. SageMaker can automatically scale training jobs up or down, from one to thousands of GPUs. Since you only pay for what you use, you can manage your training costs more effectively.

Security and Compliance: With NVIDIA BioNeMo on Amazon SageMaker, customers can deploy BioNeMo training and inference workloads into their existing AWS accounts. SageMaker ensures that customer data are encrypted in transit and at rest. You can store model artifacts and training data in encrypted Amazon Simple Storage Service (Amazon S3) buckets and pass an AWS Key Management Service (AWS KMS) key to SageMaker notebooks, training jobs, and endpoints to encrypt the attached ML storage volume. SageMaker also supports Amazon Virtual Private Cloud (Amazon VPC) and AWS PrivateLink.

MLOps integration: Amazon SageMaker provides purpose-built tools for machine learning operations (MLOps) to automate and standardize processes across the ML lifecycle. This includes SageMaker Experiments, to track metrics, datasets, and other artifacts related to training jobs. Teams can also use SageMaker Model Registry to track model versions and metadata. Finally, SageMaker Pipelines and Projects brings CI/CD best practices to ML, ensuring the smoothest possible transition from POC to production.

Getting Started

Figure 2: This image shows the architecture for training and deploying AI models using BioNeMo on SageMaker. Engineering teams first must adapt the BioNeMo framework image into SageMaker training and inference containers, hosted in Amazon ECR. Then, researchers create SageMaker training jobs to fine-tune BioNeMo models on their own data. Finally, they deploy the models to inference end points to make predictions.

To get started, sign-in or create a free account at NVIDIA NGC. You can then generate an API key to download the BioNeMo framework model weights, data, and other artifacts. These are sensitive credentials, so we strongly recommend storing them in AWS Secrets Manager!

Next, use the example Dockerfiles to build a BioNeMo training and inference image with SageMaker-specific dependencies. You will need to push these images to a private Amazon Elastic Container Registry (Amazon ECR) to use them in SageMaker.

Finally, use one of the example training or inference notebooks to fine-tune a protein language model like ESM or generate sequence embeddings. These notebooks include scripts and configuration files you can modify to fit your use case.

Conclusion

Advances in GenAI are set to revolutionize the traditionally lengthy and expensive drug discovery process. The BioNeMo framework on Amazon SageMaker allows scientists to train and deploy cutting-edge models with the scalability and security they’re used to on AWS.

For more information about NVIDIA BioNeMo on Amazon SageMaker, please check out our GitHub repository, visit the BioNeMo framework documentation, or watch our presentation at NVIDIA GTC 2024.