Integrating with Amazon SageMaker: Using Built-In Algorithms from External Applications

By Pratap Ramamurthy, Partner Solutions Architect at AWS
By Jose Noriega, Sr. Partner Development Manager at AWS

Amazon SageMaker is a fully-managed service that lets data scientists and developers build, train, and deploy machine learning models quickly and easily. The service provides a full end-to-end workflow and is built in a modular way so that customers can transfer the results of workflow stages in and out of Amazon SageMaker.

At Amazon Web Services (AWS), we are often asked how to integrate software with Amazon SageMaker and use the service’s built-in machine learning algorithms. In this post, we will discuss how to use the training capabilities of Amazon SageMaker to leverage its built-in algorithms.

The types of applications that can integrate with Amazon SageMaker are data science platforms, business intelligence tools, or any application that needs to use machine learning behind the scenes. In these integrations, developers may abstract away some of the features of Amazon SageMaker, or choose to expose a feature of Amazon SageMaker to end-users.

We’ll help you with the design decision by detailing four steps to using built-in algorithms from external applications.

About the Built-In Algorithms

Amazon SageMaker’s algorithms are highly scalable in the amount of data they can learn from, especially compared to some of the open source algorithms. Our algorithms have been fine-tuned by optimizing the data transfer between the instances, and to utilize GPUs effectively whenever applicable.

There are Amazon SageMaker algorithms for any kind of data: structured data (Linear Learner, Factorization Machines, XGboost, K-Means, PCA, Random Cut Forest), image (image classification algorithm), natural language (Sequence2Sequence, Latent Dirichlet Allocation (LDA), Neural Topic Modeling, BlazingText), and time series data (DeepAR).

In addition to the algorithms, Amazon SageMaker has three functionalities (the managed notebook and built-in algorithms, training, and model hosting) that are loosely coupled to help customers and AWS Partner Network (APN) Partners leverage this modularity and consume as needed.

SageMaker Functionalities

Here are the four steps we are going to follow to use the Amazon SageMaker algorithms from external applications.

Step 1: Installing Amazon SageMaker SDK and Boto3

From your application, you can make calls to Amazon SageMaker using the AWS SDK for Python (Boto3) or the Amazon SageMaker SDK (high-level python library). The Amazon SageMaker SDK is more concise because it abstracts some of the details, whereas Boto3 is more flexible because it’s the Python SDK for all of Amazon Web Services (AWS).

To make calls to Amazon SageMaker from your own Jupyter notebook or your development environment, you will have to install the correct libraries. Amazon SageMaker Python SDK can be easily installed using pip:

pip install sagemaker

You can then import the Amazon SageMaker package from within a Python application (or even an external notebook) either at the root or for a specific algorithm:

import sagemaker
from sagemaker import KMeans

In this post, we will focus on the Amazon SageMaker SDK and ask readers to refer to GitHub for examples of the Boto3.

Step 2: Creating IAM Roles

To access Amazon SageMaker, as in any AWS service, users must be authenticated and authorized. AWS Identity and Access Management (IAM) helps create security roles that can be used to authenticate. You can create an IAM user through the console, but when you are building a platform that caters to multiple end users, you can automate the process of creating IAM users and key generation. Once the keys are created, they can be loaded in memory or made available to end users so they can make AWS API calls.

Once the user has access to the Amazon SageMaker SDK, the next step is to create a training job. This requires additional resources that are executed by Amazon SageMaker on a user’s behalf. Therefore, Amazon SageMaker requires an IAM service-linked role to perform these tasks. Attach the appropriate policies to the IAM role, which usually includes Amazon SageMaker execution policy and Amazon Simple Storage Service (Amazon S3) access policy.

Once this is created, the Amazon Resource Names (ARN) or the name of the IAM role must be inserted as an argument in the API calls. In the example below, we are calling the built-in K-Means algorithm and passing the ARN of the IAM role as the first argument.

kmeans = KMeans(role="arn:aws:iam::000000000000:role/service-role/AmazonSageMaker-ExecutionRole-00000000000000",
               train_instance_count=2,
               train_instance_type='ml.c4.8xlarge',
               output_path=output_location,
               k=10,
               data_location=data_location)

Step 3: Make Sure the Data is in the Right Format and Bucket

Amazon SageMaker uses Amazon S3 as the primary storage for training data. Before end users start the training, data is stored in a Amazon S3 bucket and the IAM role created must have access to this bucket. In the S3 bucket, the algorithm expects a training channel and test channel, which are usually folders within the bucket with appropriate names. The data type and data format used depends on the specific algorithm data type. This information is provided as part of the algorithm’s documentation. For example, Latent Dirichlet Allocation (LDA) is an algorithm used for topic modeling in Natural Language Processing (NLP).

LDA expects data to be provided on the train channel and optionally supports a test channel, which is scored by the final model. LDA supports both recordIO-wrapped-protobuf (dense and sparse) and CSV file formats. For CSV, the data must be dense—the dimension being the product of the number of records and vocabulary size. For inference, text/csv, application/json, and application/x-recordio-protobuf content types are supported. Sparse data can also be passed for application/json and application/x-recordio-protobuf. LDA inference returns application/json or application/x-recordio-protobuf predictions, which include the topic_mixture vector for each observation.

Please see the example notebooks in GitHub for more details on training and inference formats for the built-in algorithms.

Step 4: Invoking the Amazon SageMaker API

Now that we have the Amazon SageMaker library installed, created the right permissions in IAM, and stored the data in Amazon S3, it’s time to make the API calls.

Every Amazon SageMaker built-in algorithm has a registry path of its Docker image (which contains the training algorithm and metadata) that needs to be specified in the API parameters. Other key API parameters are the algorithm-specific hyperparameters and infrastructure resources, such as instance types, number of instances, and data volume.

kmeans = KMeans(role="arn:aws:iam::000000000000:role/service-role/AmazonSageMaker-ExecutionRole-00000000000000",
               train_instance_count=2,
               train_instance_type='ml.c4.8xlarge',
               output_path=output_location,
               k=10,
               data_location=data_location)
kmeans.fit(kmeans.record_set(train_set[0]))
kmeans_predictor = kmeans.deploy(initial_instance_count=1,
                                 instance_type='ml.m4.xlarge')

In this example, we used the “ml.c4.8xlarge” instance type for training. The instance type and number of instances determines the duration of the training. Most built-in algorithms have been engineered to take advantage of GPU computing for training. However, there are some exceptions because some algorithms inherently cannot utilize GPUs and can only run on CPU instances. Here is the full list of algorithms and the CPU/GPU recommendations to use for training.

Additionally, training using some algorithms can be scaled by providing more than one instance. In this case, the algorithm automatically distributes the load between the instances and parallelizes the learning. Once the training job is completed, the instances that were used for training are automatically terminated and the model artifacts are saved in Amazon S3. The model can then be deployed for batch prediction or real-time prediction.

For deployment, we have used the “ml.m4.xlarge” instance type. This resource determines the number of inferences it can serve. Moreover, since the deployment will run continuously, the cost structure will be different from the cost incurred during training. See the full list of available instance types that can be used for machine learning .

Summary

We have explained how to train models using Amazon SageMaker built-in algorithms to make API calls from external applications. We described the four necessary steps to achieve this: installing Amazon SageMaker SDK and Boto3, creating IAM roles, transforming the data into the right format and placing it in the right Amazon S3 buckets, and making the API calls to train the models.

For additional information about creating a training job through the Amazon SageMaker API, please see the developer guide.

SageMaker WorkFlow

We hope you try this and are able to leverage Amazon SageMaker built-in algorithms. We will follow up with other Amazon SageMaker integration ideas in future posts, such as bringing your own algorithm for training or hosting a model for inference that has been trained outside of Amazon SageMaker.

Let us know about your experience in the comments section.

APN Partner Success

For some inspiration, please see what some APN Partners are doing to integrate with Amazon SageMaker:

DataRobot, an APN Advanced Technology Partner with the AWS Machine Learning Competency, has developed sample notebooks to show how to use the DataRobot modeling engine on Amazon SageMaker to automatically build and evaluate custom machine learning models.Contact DataRobot >>
SigOpt, an APN Advanced Technology Partner with the AWS Machine Learning Competency, provides starter code to help integrate their hyperparameter tuning service into Amazon SageMaker. Customers can use SigOpt to optimize an MXNet model trained in Amazon SageMaker for accuracy.Contact SigOpt >>

PipelineAI, an APN Standard Technology Partner, has created a wrapper on top of the AWS Python libraries to push TensorFlow models into an Amazon Elastic Container Registry (Amazon ECR) repository for model hosting on Amazon SageMaker. See their PipelineAI – Amazon SageMaker integration guide.Contact PipelineAI >>