AWS internal use-case: Evaluating and adopting Amazon SageMaker within AWS Marketing

We’re the AWS Marketing Data Science team. We use advanced analytical and machine learning (ML) techniques so we can share insights into business problems across the AWS customer lifecycle, such as ML-driven scoring of sales leads, ML-based targeting segments, and econometric models for downstream impact measurement.

Within Amazon, each team operates independently and owns the decision-making on how they make their own technology stack choices and adopt AWS services as are available to our customers. This allows each team to manage their own roadmaps. This leads to similar paths for service evaluations and adoption as our customers may follow who are in initial phases of their ML journey. This blog documents the first step of our journey in evaluating Amazon SageMaker service and we hope this will help others going through the same process. We will continue to provide updates on our journey and how we utilize AWS services to meet our business objectives and scale our ML models.

In this blog post, we share our experience conducting a proof of concept (POC) for the use of Amazon SageMaker to replace our own ML training/hosting infrastructure. Our existing ML infrastructure consists of an Amazon internal workflow tool for managing the data processing pipeline and Amazon EC2 instances to build, train, and host models. We spend a lot of time managing this infrastructure, and that’s what motivated us to explore the Amazon SageMaker service.

This POC explores Amazon SageMaker features and capabilities that can help us minimize infrastructure work and operational complexity. We identified three key work streams:

Identifying and implementing data security and controls in partnership with the AWS IT security team
Migration and hosting of one of our existing ML models in Amazon SageMaker using the “Bring your own Algorithm” feature (use case one)
Training and hosting the same ML model using a built-in Amazon SageMaker algorithm (use case two)

We discuss all three work streams in detail in this blog post.

Identifying and implementing data security and controls in partnership with AWS IT security team

We worked with AWS IT Security to identify different threats and implemented necessary security controls for the dev environment before building and training our models with real world data. Here are the key security controls that we implemented to secure approvals from the AWS IT security team:

Setting up user, role, and encryption key: For training the models, we needed to keep data in Amazon S3. We implemented server-side encryption using an AWS KMS managed key (SSE-KMS) for data encryption in our Amazon S3 bucket.

These are the setup steps:

Create an IAM user (poc-sagemaker-iam-user2) that has AmazonSageMakerFullAccess Policy (among others).
Create an IAM role (poc-sagemaker-iam-role) based on AmazonSageMakerFullAccess
Create an encryption key named s3-smcmk and allow both IAM user and role to use the key. This allows Amazon SageMaker training jobs with poc-sagemaker-iam-role, to use the key to read the training data.
Create an Amazon SageMaker notebook instance with the same IAM role and encryption key, so that the notebook instance had read access to the data and also encrypts its local storage.

Setting up the IAM user, role, and encryption key ensures that only the user/role is able to read the training data using the encryption key regardless of the S3 bucket access policy setting.

Securing credentials: We stored all security credentials, such as the AWS account access ID, secret ID, and password, in a secrets vault. Credentials should not be stored in any configuration file or in the source code. You can use the recently launched AWS Secrets Manager service to store your credentials and keys.

Data preparation

For model training in Amazon SageMaker, we decided to make the pre-processed data available in Amazon S3. We created an S3 bucket using poc-sagemaker-iam-user2 user with the SSE-KMS encryption option and used the s3-smcmk encryption key created during the setup of security controls. We configured s3-smcmk to be used by both poc-sagemaker-iam-user2 and poc-sagemaker-iam-role because it provides a central location to check/control who has access to the data.

We pre-processed the model training data in an Amazon RedShift data warehouse (DW) cluster and created three DW tables with training, validation, and test datasets. Amazon RedShift provides a capability for executing SQL statements and saving the resulting data into a S3 bucket. To accomplish this, we can use the “Unload” command to load data into our pre-configured S3 bucket. More details on the Unload command can be found here, with details on authorization parameters to access data are here. Here is the SQL statement that we used to unload the pre-processed datasets to S3:

UNLOAD('<put your select statement here>') 
to '<put the full location path of Amazon S3 bucket for the output file>'
ACCESS_KEY_ID '<we used poc-sagemaker-iam-user2 Access Key here>'
SECRET_ACCESS_KEY '<we used poc-sagemaker-iam-user2 Secret Key here>'
KMS_KEY_ID '<we used keyid of s3-smcmk here>'
DELIMITER AS ',' NULL AS '' ENCRYPTED ALLOWOVERWRITE PARALLEL OFF

If your data volume is small, an alternate approach can be to use an Amazon SageMaker notebook instance to load the data into memory directly from the data source, split the data into train, validation, and test datasets, and then store all three datasets in an S3 bucket. In this case, you need to select a notebook instance type that has the required resources (CPU, disk, memory) to handle your data volumes.This SQL command is for illustrative purposes only. It uses the poc-sagemaker-iam-user2 privilege while unloading the data and the s3-smcmk key to encrypt the unloaded data in Amazon S3. For obvious security reasons, we didn’t hardcode the access and secret keys in our code or read it from a plain-text configuration file. Instead, we stored those keys in a secrets vault. You can use AWS Secrets Manager to accomplish this. We accessed these secrets programmatically and generated the SQL command at runtime. While generating the SQL command, we ensured that the first column of the resulting dataset is the dependent variable and the rest of the columns are independent variables. This is the specific format required by the built-in SageMaker XGBoost algorithm used for the evaluation.

Amazon SageMaker notebook Instance launch

A common first step for exploring either “Bring your own algorithm” (use case one) or using built-in algorithms (use case two) is to create an Amazon SageMaker notebook instance using these steps for model building, training, and hosting. We used the IAM role poc-sagemaker-iam-role and the s3-smcmk encryption key when we created a notebook instance.

We then wrote and executed the code for both POC use-cases for training and hosting the ML models in an Amazon SageMaker-managed infrastructure. More documentation on the training steps is available here.

POC use case one: Using the Amazon SageMaker “Bring-Your-Own-Algorithm” feature

For this use-case, we re-used our R code for ML model training and inferences. We migrated that code to Amazon SageMaker using the Bring-Your-Own-Algorithm feature. We followed these steps to complete the migration:

We prepared a Docker image with the R environment and installed the required R library for our model. A sample Dockerfile to build a Docker image with an R environment is available here. We customized the Dockerfile to install the latest version of R, related dependent packages, and the XGBoost library from the CRAN repository instead of using the default Ubuntu repository. This was necessary because our model uses the XGBoost library that required a later version of R than the one available in the Ubuntu repository. We added the following two lines of code to the Dockerfile to accomplish this:
```
RUN echo "deb http://cloud.r-project.org/bin/linux/ubuntu xenial/" >> /etc/apt/sources.list
RUN R -e "install.packages(c('xgboost','caret','e1071','plumber'),repos='https://cloud.r-project.org')"
```
The first line of code needs to be the first RUN command in the Dockerfile. This ensures that subsequent RUN commands install the R environment from the CRAN repository instead of the Ubuntu repository.
We then integrated our R code into the Docker image. You can refer to this example that shows the steps required. To summarize, we updated the train function in mars.R with our training code and the invocations function in Plumber.R with our inference code.

We tested the Docker image on a local machine. We first created a folder structure (Fig 1) that Amazon SageMaker makes available to the Docker container during training. We populated those folders with the training and validation datasets and the hyperparameter configuration files.

Figure 1 : Folder structure for local Docker Image testing. Refer here for more details.

We then started the Docker container and mounted this folder structure into the Docker filesystem. We used the following command to start the Docker container in training mode for local testing:
```
docker run -v $(pwd)/ml:/opt/ml -t {image name} train
```
We also changed our R code to read the training/validation datasets and the hyperparameter configuration from the mounted folder structure. This allowed us to simulate the filesystem that the Docker container would have when it is invoked within Amazon SageMaker.
After the Docker image worked as intended, we built and pushed the Docker image to an Amazon Elastic Container Registry (Amazon ECR). Refer to the sample script available here for the detailed commands.
We then used an Amazon SageMaker notebook instance to start a training job that was configured to use the published Docker image and the training/validation datasets from the Amazon S3 bucket. After training was completed, we created a model, an endpoint configuration, and finally an endpoint to host the model. We used the endpoint for generating inference on the test dataset. You can refer to an example available here for detailed steps.

The following table lists the pros and cons of using SageMaker for the first use-case:

Pros

Able to reuse existing code for model training and inference.
Good documentations and sample code available to make the migration process easier.

Cons

Creating and maintaining your own Docker image in Amazon ECR introduces operational complexity.
The Amazon SageMaker hosted model endpoint can accept a maximum of 5 MB per invocation for real-time inference. It needs additional code for batch inference (discussed in a later section).
Team might need to learn some platform skills, such as building Docker images and storing them in ECR, local-testing, etc.

POC use-case two: Using a built-in Amazon SageMaker algorithm

For this use-case, our motivation was to remove the burden of building and managing our own Docker image. Therefore, instead of using our self-built Docker image (as outlined in the first POC use-case), we used the Amazon SageMaker-managed Docker image for the XGBoost algorithm to train and host our ML model.

To use the built-in SageMaker algorithm, we simply used the SageMaker-managed Docker image for XGBoost during training and inference processes. You can refer to this example that shows the steps required. The built-in SageMaker XGBoost algorithm can be trained in single or multiple instances. We changed the “ResourceConfig:InstanceCount” value in training parameters to enable distributed training. The following is a configuration snippet for distributed training.

The following table lists the pros and cons of using SageMaker for the second use-case:

*Pros*	Reduces the undifferentiated effort of maintaining training and inference Docker image. This reduces the operational complexity of deploying and maintaining models. Built-in Amazon SageMaker XGBoost algorithm allows training in distributed environment.
*Cons*	The SageMaker hosted model endpoint can accept a maximum of 5 MB per invocation for real-time inference. It needs additional code for batch inference (discussed in a later section).

Using Amazon SageMaker endpoints for batch inferences

Amazon SageMaker has an API action called “InvokeEndpoint” that is used to get inferences from the model that is hosted in SageMaker. We invoked this API through a HTTP POST call and provided the data on which the inference is expected in the HTTP body. At most 5 MB of data can be sent for inference per invocation and the invocation call itself is blocking. Our use case was for batch inferences with data volumes more than 5 MB. So we could not send all the data through the InvokeEndpoint API action. To get batch inferences, we read the data for inference into memory from Amazon S3 and split it into chunks that were less than 5 MB, calling InvokeEndpoint iteratively. We used the following code to obtain chunk-wise inference:

#Read complete file from S3
body_raw=s3.get_object(Bucket=bucket,Key=data_fn)['Body'].read()

#Alternatively, if your data file is large, read by chunks
# body_raw=s3.get_object(Bucket=bucket,Key=data_fn,Range="bytes=0-500")['Body'].read()

#Convert data to bytes
data=io.BytesIO(body_raw)

#Split the data read into chunksize. For our dataset chunksize=3000 ensured that HTTP #POST body size will be less than 5MB. Correct chunksize depends on the dataset.
for chunk_df in pd.read_csv(data, chunksize=chunksize,header=None):
	#chunk_df contains a chunk of data
	#convert the chunk into CSV format
csv_str=chunk_df.to_csv(None,header=False,index=False)

#Send the csv chunk to SageMaker hosted model Endpoint (endpoint_name) 
#content_type is ‘text/csv’
response = runtime.invoke_endpoint(EndpointName=endpoint_name,
                                         ContentType=content_type,
                                         Body=csv_str)

#response is available in json format. Read and decode it.
result = response['Body'].read().decode(‘ascii’)

Summary

In this blog post, we shared our experience exploring features and capabilities of Amazon SageMaker for our ML platform needs through three work streams. Here are our learnings:

Amazon SageMaker significantly reduces the effort of managing a ML platform and thereby allows the team to spend more time building and testing models rather than performing undifferentiated platform management.
We found Amazon SageMaker capability of distributed training to be a key feature that enables our team to train ML models at scale.
Amazon SageMaker allows easier integration of models in Service-Oriented-Architecture because the models are hosted with a HTTP endpoint.

In the end, based on our experience, we decided to migrate to Amazon SageMaker in the near future for all our current and future ML platform needs.

Resources

Getting started with AWS Secrets Manager

Amazon Redshift UNLOAD command documentation

Amazon Redshift authorization parameters to access data

Create an Amazon SageMaker notebook Instance

Create a training job in Amazon SageMaker

Building and testing your own algorithm container

A sample DockerFile to build a Docker image with R environment

A sample script to build a Docker image and push it to Amazon ECR

Implement Bring Your Own R Algorithm in Amazon SageMaker

Using Amazon SageMaker XGBoost algorithm

About the Authors

Azman Sami is a Data Scientist in the AWS Marketing organization where he focuses on building scalable platform for real-time analytics and machine-learning. Prior to this, he worked as IT Enterprise Architect and Software developer for 10 years.

Alam Khan is a Sr. Analytics Manager in the AWS Marketing organization where he manages the data and ML platform strategy for the data science team. He has over 9 years of industry experience in building data platforms and automation.

Neelesh Gattani is a Sr. Manager, Data Science in the AWS Marketing organization where he manages program measurement strategy, machine learning initiatives for targeting and sales lead scoring and the internal ML platform.

AWS Machine Learning Blog

AWS internal use-case: Evaluating and adopting Amazon SageMaker within AWS Marketing

Identifying and implementing data security and controls in partnership with AWS IT security team

Data preparation

Amazon SageMaker notebook Instance launch

POC use case one: Using the Amazon SageMaker “Bring-Your-Own-Algorithm” feature

POC use-case two: Using a built-in Amazon SageMaker algorithm

Using Amazon SageMaker endpoints for batch inferences

Summary

Resources

About the Authors

Resources

Blog Topics

Follow