AWS Machine Learning Blog
Build an online compound solubility prediction workflow with AWS Batch and Amazon SageMaker
Machine learning (ML) methods for the field of computational chemistry are growing at an accelerated rate. Easy access to open-source solvers (such as TensorFlow and Apache MXNet), toolkits (such as RDKit cheminformatics software), and open-scientific initiatives (such as DeepChem) makes it easy to use these frameworks in daily research. In the field of chemical informatics, many ensemble computational chemical workflows require the ability to consume a large number of compounds and profile various descriptor properties.
This blog post describes a two-stage workflow. In the first stage, take approximately 1100 candidate molecules and use AWS Batch to calculate various 2D molecular descriptors using a Dockerized RDKit image. The original dataset from MoleculeNet.ai – ESOLV includes the measured logSolubility (mol/L) for each compound. In the second stage use Amazon SageMaker, with Apache MXNet, to develop a linear regression prediction model, The ML model will perform a 70/30 split of training and validation sets with a RMSE=0.925 and a goodness of fit (Rˆ2) of 0.9 after 30 epochs.
In this blog post, you create the workflow to process the simplified molecular-input line-entry system (SMILES) input, which you then feed into Amazon SageMaker to create a model that predicts the logSolubility.
Overview
Start by storing the SMILES structures in an Amazon S3 bucket. Then build an Amazon Elastic Container Repository (Amazon ECR) image with a Python-based execution script and support libraries into a image and execute on AWS Batch. The output of the calculations is stored in another S3 bucket, which is input into Amazon Sagemaker.
Prerequisites
To follow the procedures in this blog post, you need an AWS account. You will create a Docker image on your local computer, so install and set up Docker. You also need to install the AWS command line interface (AWS CLI).
Stage 1: Using AWS Batch
AWS Batch is a managed service for running jobs from containerized applications stored in the Amazon ECR registry. An “amazonlinux:latest” image, which contains the base operating system (OS) for installing the tools needed to execute the workflow, was pulled from the public Docker registry. After you install Docker, open a command line shell and run the following command:
After pulling the image layers, start an interactive session to prepare the image for the descriptor calculations in AWS Batch. Install the AWS CLI, RDKit, and Boto3 framework packages in the image. In the running Docker container, run these commands:
To install RDKit, first enable the repo. For code, see the EPEL 7 repository.
After enabling the repo, install it by running the following commands:
Add the following Python code to the image, then save it as /data/mp_calculate_descriptors.py.
This code is the main engine for calculating the descriptors. The script reads the input SMILES file from the S3 bucket, calculates a series of descriptors, and then stores the results in another S3 bucket.
Open the Amazon S3 console and create buckets called rdkit-input-<initials>
and rdkit-processed-<initials>
with access limited to your AWS account.
Next, commit your Docker image to the Amazon ECR registry in your AWS account. Create a new registry by opening the Amazon ECS console, choosing Repositories on the left panel, and then Create repository. You get an endpoint similar to the following:
To make the endpoint accessible to AWS Batch, push your Docker image to the endpoint:
AWS Batch requires that you set up a job with a job definition, which is then submitted to a job queue that is executed on a compute environment. The JSON job definition provides the input parameters for the RDKit job. The following is an example for this workflow:
In the job definition, define the OUTPUT_SMILES_S3 and INPUT_SMILES_S3 environmental variables. This is the path to the SMILES file uploaded to Amazon S3. This variable is passed into the Python script in the container. To ensure you have the correct permissions, define a jobRole (set in the IAM console) that has read and write access to Amazon S3. The Python script is natively parallelized, so the larger the instance, the greater the level of parallelization to process the SMILES file. The following table profiles the dataset using the c4 and m4 family of EC2 instances.
Run the AWS Batch job. You should see a file (*_smiles_result.csv) similar to the following in the rdkit-processed bucket:
The original input file includes a column of measured logSolubility(mol/L). To prepare for the Amazon SageMaker stage that uses the SMILES as a primary key, append this column to the result file. This can be achieved by downloading the csv from S3 and uploading it again after appending the measured logSolubility values.
Stage 2: Using Amazon Sagemaker
In the Amazon SageMaker console, under Dashboard, choose Create notebook Instance, and then fill in the details of the interface as shown in the following screenshot. If this is your first time in the console, Amazon SageMaker will ask you to create an IAM role, which will need access to Amazon S3. It’s optional to set the VPC and subnet settings.
After you have completed these tasks choose Create notebook instance, It takes a few minutes for the instance to be created. After your instance is started choose Open, and you will be redirected to the instance and the Jupyter notebook interface.
Create a new Jupyter notebook in the instance using the conda_mxnet_p27 environment. We also provide environments using TensorFlow as well as base Python 2 and Python 3 environments. Alternatively, you can simply download the entire notebook here.
Let’s create the notebook for training the candidate compounds. First, we need to define some S3 bucket variables in the location where we stored the result file from the first part of the workflow, after we have appended the solubility values.
Next we need to import the RDKit libraries into the environment:
Several dependencies will be installed. Next we will import the modules we will use in this exercise.
Next we will read in the file from the deepchem GitHub.
Now we will read in the SMILES from the file and parse the structures.
We can visualize a few of our structures in the deepchem set (optional).
You will get an output:
Next we will import our results from the AWS Batch workflow described earlier.
We will split our data into a 70/30 training and validation sets with shuffling and prepare for modeling. The output will print out the data shapes for the training and validation sets.
Now we define our linear modeling parameters. This is a relatively straightforward linear regression coupling batch normalization of the 2D descriptor set with a hyperbolic tangent activation. The output is a visual representation of the neural network.
At this point we can train the model and check our validation score against the test.
You should get the debug log of the training with an approximate speed of 20k samples/sec. Finally, we can plot our result score and use the model to predict the entire dataset.
The light purple data points (left panel) comprise the entire dataset. The green subset is the compounds that have been selected for independent validation. The evaluation set adequately represents the diversity of the entire deepchem set with a validation score above 90 percent. The panel on the right represents the element-wise prediction error distribution. At this point, with the model built, you can create an endpoint and deploy it. You can follow along here for an example on how to deploy and create an endpoint for the model.
Summary
In this blog post you have successfully created a container-based RDKit platform in Amazon ECS and AWS Batch, processed a collection of compounds for molecular descriptor calculations, and developed an Apache MXNet ML model, through Amazon SageMaker, to predict the solubility.
I invite you to play around with the various model.fit() parameters in the Amazon SageMaker notebook. You can modify the optimizer, learning rates, and epochs. If you are able to improve the validation score please respond in the comments section of the blog.
If you have any questions, please leave them in the comments.
Next Steps
Connect with machine learning experts from across Amazon with the Amazon ML Solutions Lab.
About the Author
Amr Ragab is a High Performance Computing Professional Services Consultant for AWS, devoted to helping customers run computational workloads at scale. In his spare time he likes traveling and finds ways to integrate technology into daily life.