AWS HPC Blog

AI-based drug discovery with Atomwise and WEKA Data Platform

This post was contributed by Shailesh Manjrekar – Head of AI and Strategic Alliances at WEKA. 

The Covid-19 pandemic has profoundly changed the world. The remote workplace has become the norm. We have started looking at personal health differently – the way we work, live, play and do business. AI’s use for drug discovery has accelerated post-Covid-19 era.

Today, drug discovery is an expensive proposition, with a $2.6 billion cost over 10 years and just a 12% success rate. AI promises to significantly improve this. Innovative startups are attempting to change the landscape with the use of AI/ML. On the forefront is Atomwise, with its AtomNet® platform. It has  succeeded in finding small molecule hits for more undruggable targets than any other AI drug discovery platform.

In this blog, we will lay out the challenges of high cost, long cycle time and low success rate faced during drug discovery process, and show how AI/ML startups  have stepped up to uniquely solve these challenges using best of breed technology solutions from Atomwise, AWS, and WEKA.

Best of breed technology stack with the AtomNet® platform

Atomwise’s AtomNet® is built on best-in-class engineering architecture and tools, with WEKA  and AWS as key technology partners. The AtomNet®® platform enables massive scale and unprecedented speed needed to create a deep and broad pipeline of drugs to improve human health. The platform leverages CNN (Convolutional Neural Nets), which employ deep learning in three dimensions to the molecular recognition problem. In many ways, it’s the same approach as deep learning for image recognition. Instead of learning low-level image features, the networks learn low-level features of 3D molecular interactions and associate them into higher-order concepts that explain and predict important labels like binding affinity to a particular protein.  This AI-based approach, is then effectively used for drug discovery or for precision medicine to eliminate diseases such as cancer and Sars-COV-2019.

The Data Challenge

The small molecule drug discovery process is very data intensive. The drug discovery process takes around 4,000 different protein structures, with over 3 million molecule compounds, and runs over 15 million experiments. This equates to importing data from 15 million source databases, running ETL (Extract, Transform, Load)  to generate around 30 million small files used for training Convolutional Neural Nets (CNN) models. CNN models employ deep learning in three dimensions to the molecular recognition problem. In many ways, it’s the same approach as deep learning for image recognition. Instead of learning low-level image features, the networks learn low-level features of 3D molecular interactions and associate these into higher-order concepts. These concepts explain and predict important labels like binding affinity to a particular protein, which can then be used to treat a disease.

Figure 1: AtomNet® platform has different storage requirements for each of the above phases, resulting in storage silos and delayed insights.

Figure 1: AtomNet® platform has different storage requirements for each of the above phases, resulting in storage silos and delayed insights.

To put this in perspective, each model takes about:

  • Six P2, P3, or P4 GPU instances on AWS and there are with 5M weights, 0.5–4 four days of epoch times with 30–50 such epochs
  • around 10,000 such development instances are running as spot instances at a given time, and are orchestrated using AWS EKS ( Elastic Kubernetes service)
  • resulting in 1-2M random access file lookups

Why WEKA on AWS for AtomNet® Platform?

Taking all of these protein structures and sampling them against the molecule compounds presents a daunting data challenge and needs a distributed filesystem, which can provide metadata and mixed read/write I/O performance.

Atomwise evaluated several storage solutions to meet their data requirements. Atomwise evaluated local filesystem on a multi-core server, Amazon EBS with NFS head to Amazon EFS, an in-memory Redis database server, and finally discovered WEKAFS to be the ideal solution to meet their desired performance, handling entire data pipeline requirements and in particular random access for Lots of small files (LOSF).

WEKA shared storage solution is built on Amazon EC2 and S3 instances. WekaFS is a parallel distributed filesystem presenting the high-performance tier using NVMe Flash drives on EC2 instances and Amazon S3 bucket as the capacity tier in a single global namespace. This eliminates storage silos for Atomwise’s entire data pipeline for ingest, ETL, train, inference and lifecycle mgmt. WEKA also provides built-in data protection and provisioning using snap2object functionality and works with Amazon Elastic Kubernetes Service (Amazon EKS), for job scheduling and orchestration

Figure 2: WEKA cluster on Amazon EC2, serving the Amazon P2, P3, P4 GPU instances orchestrated by Amazon EKS

Figure 2: WEKA cluster on Amazon EC2, serving the Amazon P2, P3, P4 GPU instances orchestrated by Amazon EKS

WEKA Performance results with AtomNet® platform

The following performance results were captured showcasing the time it took for AtomNet® platform, small file metadata operations for NFS based storage vs. WEKA. WEKA showed 39x and 25x better performance over NFS on Amazon EC2. Reading and writing of small files improved by 77x and 168x, respectively.

Table 1: AtomNet® platform performance on small file metadata operations for NFS based storage vs. WEKA

Table 1: AtomNet® platform performance on small file metadata operations for NFS based storage vs. WEKA

Business outcomes with Atomwise, AWS and WEKA solution

The WEKA and Atomwise solution on AWS provided the best results for spiky workloads like computational drug discovery and resulted in the following Key Performance Indicators (KPI’s)  –

Experimentation time improved from 12 weeks (3 months) to 1 week. This resulted in faster time-to-insights while epoch times (Convolutional Neural Net model training times) improved by 2x.

WEKA demonstrated excellent scale for these workloads with 10,000 EC2 instances accessing WEKA cluster with excellent metadata performance for 30 million small and large files (LOSF – Lots of small files)

Conclusion

WEKA file solution in AWS is ideal for customers implementing HPC and life sciences use cases, and is well integrated with the AWS HPC and life sciences solutions.

Customers performing computational chemistry and structural biology, genomics, bioimaging are often doing modeling and simulations. These workloads can benefit from the scalability, compliance, and ability to launch hybrid workflows offered by WEKA.

WEKAFS is available in AWS Marketplace and provides the  best performance and economics. Customers can try out WEKA here.

Watch the joint AWS webinar here.

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.