AWS for Industries
Accelerating Drug Discovery with AWS HealthOmics and NVIDIA Blueprints
Discovering new life-changing therapies has never been more urgent, yet the drug discovery process remains a complex and time-consuming endeavor. To accelerate breakthroughs, leading biotech and biopharma companies are turning to Biological Foundation Models (BioFMs), a type of advanced generative AI models trained on biological data.
We are excited to announce that NVIDIA’s generative AI Blueprints, foundational models and inference microservices, are now available on Amazon Web Services (AWS) HealthOmics, providing drug researchers with a comprehensive solution to streamline computer-aided virtual screening.
While BioFMs hold immense promise for accelerating drug discovery, researchers face significant challenges in leveraging their full potential. Scientists require sophisticated tools that enable efficient experimentation, allowing them to quickly get started with proven example workflows and adapt every aspect of these recipes to their specific scientific tasks. Additionally, they need the ability to seamlessly transition from discovery and testing to running their customized pipelines at scale on the same platform, ensuring consistency and reproducibility.
Furthermore, collaboration is crucial, as researchers must be able to share their proven workflows with colleagues across the organization, fostering knowledge sharing and enabling everyone to move faster toward breakthroughs. Funding agencies like the National Center for Advancing Translational Sciences (NCATS)’ National Institutes of Health (NIH)‐Industry Partnerships initiative also create opportunities and mechanism to support collaborations between biopharma and academic researchers to advance Investigative New Drug (IND) in clinical innovation.
The HealthOmics and NVIDIA Solution
AWS HealthOmics is a fully managed biological data compute and storage service designed to accelerate scientific breakthroughs in clinical diagnostics and drug discovery. By offloading the complexities of building, orchestrating, and managing bioinformatics and drug discovery pipelines, HealthOmics enables researchers to focus on core scientific activities. Leading biopharma companies like Roche, Amgen, and Takeda are already leveraging HealthOmics to significantly accelerate their scientific outcomes and drive faster time-to-insights.
Through this collaboration with NVIDIA, researcher can now access state-of-the-art AI models and infrastructure for their drug discovery workflows. The integrated solution combines NVIDIA BioNeMo foundation models and NVIDIA NIM Agent Blueprints, with the robust workflow orchestration, scalability, security, and compliance features of HealthOmics.
This offering addresses the key challenges faced by researchers in leveraging generative AI for reliable and reproducible drug discovery research. It enables scientists to leverage leading BioFMs and high-performance computing resources in a seamless, scalable, and cost-effective manner. Furthermore, pre-packaged technical assets are now available through the HealthOmics drug discovery workflows GitHub repository to help biopharma and biotech researchers get started quickly.
Integrated Drug Discovery Technical Assets
The HealthOmics drug discovery GitHub repository demonstrates the process of building, deploying, and customizing generative AI applications on HealthOmics. Key components of the example virtual screening workflow include:
- Sample drug discovery workflow and processes using NVIDIA NIMs containers are defined in Nextflow Domain Specific Language (DSL).
- Command line scripts and python SDK examples demonstrate how to orchestrate multi-step workflows on HealthOmics in a scalable, robust and cost-effective way.
By leveraging these pre-packaged assets, researchers can gain a head start in creating their own generative AI applications for drug discovery, benefiting from NVIDIA’s advanced AI tools and end-to-end discovery experience tailored for this use case. This combined solution streamlines the process of harnessing the power of generative AI, enabling biotech and biopharma companies to accelerate their drug research and discovery efforts and unlock new possibilities for groundbreaking therapies.
To get started, you will first need to create a private workflow in HealthOmics using the example Nextflow DSL main.nf file. This sample workflow includes three core processes: MolMIM, AlphaFold2, and DiffDock.
MolMIM can be used to explore “drug-like” chemical space to generate novel small molecules, structurally similar to a given query in a SMILES string, with the improved values of the desired properties. This transformer-based model can also be used to:
- Embed molecular representation in numeric high dimensional space
- Decode embeddings back into SMILES
- Generate hidden/latent states to analyze underlying properties and patterns
- Sample the latent space to generate novel and diverse set of small molecules for a given seed
- Generate novel small molecules using CMA-ES-guided sampling with desired properties or characteristics
The example presented here is using the CMA-ES-guided generation API function in a MolMIM container. You can configure the criteria for the desired properties (for example, QED and LogP) and similarity thresholds in the molmim_generate python script.
AlphaFold2 folds the protein sequences into 3D structures. There are two main steps for this prediction:
- Multiple sequence alignment (MSA) by searching for a collection of similar protein sequences from the reference databases, such as UniRef90, MGnify, BFD, and so on.
- Identify co-evolutionary signals highlighting similarities and differences of MSA results and generate structure hypothesis using an Evoformer neural network.
DiffDock is designed to predict ligand-protein binding poses when provided the ligand as an input in SDF format and protein in PDB format. Once the HealthOmics workflow is successfully created, you will be able to see additional details and create jobs on the AWS management console.
You can run a job with similar example to identify novel small molecules, which may bind and inhibit SARS CoV-2 protease activities. All you need to provide are SMILES strings in the SMI file, as well as the protein sequence for SARS CoV-2 protease in the FASTA file. Key challenges include:
- Downloading large reference sequence databases to run AlphaFold2.
- Retry mechanism for the failed jobs.
- Running processes in parallel.
Once the job is submitted, you can track the job status and logs on the HealthOmics console. HealthOmics takes advantage of Nextflow queue channels to run distributed jobs. In this example, each SMI input file will trigger a separate MolMIM run task, and the same for an AlphaFold run task triggered by a FASTA file. In Figure 4, you can see five MolMIM tasks and two AlphaFold tasks run in parallel.
The DiffDock tasks were triggered by the combination of MolMIM and AlphaFold outputs, and two separate DiffDock run tasks were triggered. The user will only be charged for the compute runtime for the selected instance size, which translates to a more predictable cost.
Additionally, there are no persistent resources, such as storage for the 500+ GB reference data which AlphaFold requires. This means that when work is not being done, your costs scale to $0. Figure Six is an example of a timeline chart for a run, which includes data staging of the resources. You can generate similar graphs using the open source AWS HealthOmics Tools.
MolMIM and AlphaFold2 jobs do not depend on each other, so they run in parallel. However, AlphaFold2 needs to download 500+ GBs of model weights and reference sequences, so the actual AlphaFold2 starts later after the MolMIM job is finished.
Conclusion
Leveraging BioFMs presents significant opportunities and challenges for drug researchers. The collaboration between AWS HealthOmics and NVIDIA Blueprints represents a comprehensive approach to overcoming these hurdles and streamlining the entire drug discovery workflow—from initial experimentation to large-scale production.
Customers can start by building their generative AI drug discovery applications using NVIDIA Blueprints, which provide everything needed to build and deploy customized generative AI applications that provide GPU-optimized state-of-the-art AI models. These Blueprints include sample applications, reference code, customization documentation, and deployment tools, enabling researchers to leverage NVIDIA’s advanced AI tools for virtual screening.
Once the initial applications are built and tested, customers can seamlessly transition to scaling up their drug design pipelines using AWS HealthOmics. This combined solution offers robust capabilities for security, compliance, and cost optimization, while offloading the undifferentiated heavy lifting of managing BioFMs at scale while providing access to high-performance computing resources.