Introducing Amazon SageMaker Operators for Kubernetes

AWS is excited to introduce Amazon SageMaker Operators for Kubernetes in general availability. This new feature makes it easier for developers and data scientists that use Kubernetes to train, tune, and deploy machine learning (ML) models in Amazon SageMaker. You can install these operators on your Kubernetes cluster to create Amazon SageMaker jobs natively using the Kubernetes API and command line Kubernetes tools, such as kubectl. For more information, see Whitepaper – Machine Learning on Amazon SageMaker and Kubernetes.

Many AWS customers use Kubernetes, an open-source, general-purpose container orchestration system, to deploy and manage containerized applications. Amazon EKS provides a managed service to deploy Kubernetes. Data scientists and developers can set up on Kubernetes repeatable ML pipelines and maintain greater control over training and inference workloads. However, to support ML workloads, you still need to write custom code to optimize the underlying ML infrastructure, provide high availability and reliability, provide data science productivity tools, and comply with appropriate security and regulatory requirements. For example, if you are a Kubernetes customer using GPUs for training and inference, you often need to change how Kubernetes schedules and scales GPU workloads to increase utilization, throughput, and availability. Similarly, for deploying trained models to production for inference, you have to spend additional time setting up and optimizing your autoscaling clusters across multiple Availability Zones.

Amazon SageMaker Operators for Kubernetes bridges this gap and spares you the heavy lifting of integrating your Amazon SageMaker and Kubernetes workflows. As of this writing, you can make a simple call from kubectl to Amazon SageMaker, a modular and fully-managed service that makes it easier to build, train, and deploy ML models at scale. With workflows in Amazon SageMaker, compute resources are pre-configured and optimized, only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, offering near 100% utilization. With Amazon SageMaker Operators for Kubernetes, you can continue to enjoy the portability and standardization benefits of Kubernetes and EKS, along with integrating the many additional benefits that come out-of-the-box with Amazon SageMaker, no custom code required.

Amazon SageMaker and Kubernetes

Machine learning is more than just the model. The ML workflow consists of sourcing and preparing data, building ML models, training and evaluating these models, deploying them to production, and ongoing post-production monitoring. Amazon SageMaker helps you build, train, deploy, and maintain models more quickly.

However, the workflows related to building a model are often one part of a bigger pipeline that spans multiple engineering teams and services that support an overarching application. Kubernetes users, including EKS customers, deploy workloads by writing configuration files, which Kubernetes matches with available compute resources in your Kubernetes cluster. While Kubernetes gives you control and portability, running ML workloads on a Kubernetes cluster brings unique challenges. For example, the underlying infrastructure requires additional management, such as optimizing for utilization, cost, and performance; complying with appropriate security and regulatory requirements; and ensuring high availability and reliability. This undifferentiated heavy lifting takes away valuable time and resources from bringing new ML applications to market. You want to control orchestration and pipelines without having to manage the underlying ML infrastructure and services in your cluster.

Amazon SageMaker Operators for Kubernetes addresses this need by bringing Amazon SageMaker and Kubernetes together. From Kubernetes, you get a fully managed service that is designed and optimized specifically for ML workflows. Infrastructure and platform teams retain control and portability by orchestrating workloads in Kubernetes, without having to manage the underlying ML infrastructure and services. To add new capabilities to Kubernetes, you can extend the Kubernetes API by creating a custom resource that contains your application-specific or domain-specific logic and components. Operators in Kubernetes allow you to invoke these custom resources and automate associated workflows natively. You can add Amazon SageMaker as a custom resource in Kubernetes by installing SageMaker Operators for Kubernetes on your Kubernetes cluster. You can then use the following Amazon SageMaker operators:

Train – Train ML models in Amazon SageMaker, including Managed Spot Training, to save up to 90% in training costs, and distributed training to reduce training time by scaling to multiple GPU nodes. You pay for the duration of your job, offering near 100% utilization.
Tune – Tune model hyperparameters in Amazon SageMaker, including with Amazon EC2 Spot Instances, to save up to 90% in cost. Amazon SageMaker Automatic Model Tuning performs hyperparameter optimization to search the hyperparameter range for more accurate models, saving you days or weeks improving model accuracy.
Real-time inference – Deploy trained models in Amazon SageMaker to fully managed autoscaling clusters, spread across multiple Availability Zones, to deliver high performance and availability in real time.
Batch transform – Create a managed execution of a model on large datasets. You can use this for either preprocessing in preparation for training or inference of an existing trained model within Amazon SageMaker.

Amazon SageMaker Operator for Kubernetes provides you with a native Kubernetes experience for creating and interacting with your jobs, either with the Kubernetes API or with Kubernetes command line utilities such as kubectl. Engineering teams can build automation, tooling, and custom interfaces for data scientists in Kubernetes by using these operators—all without building, maintaining, or optimizing ML infrastructure. Data scientists and developers familiar with Kubernetes can compose and interact with Amazon SageMaker training, tuning, and inference jobs natively, as you would with Kubernetes jobs executing locally. Logs from Amazon SageMaker jobs stream back to Kubernetes, allowing you to natively view logs for your model training, tuning, and prediction jobs in your command line.

Using Amazon SageMaker Operators for Kubernetes with XGBoost

This post demonstrates training a gradient-boosting model on the Modified National Institute of Standards and Technology (MNIST) dataset using the training operator. The MNIST dataset contains images of handwritten digits from 0 to 9 and is a popular ML problem. The MNIST dataset contains 60,000 training images and 10,000 test images.

This post performs the following steps:

Install Amazon SageMaker Operators for Kubernetes on an EKS cluster
Create a YAML config for this training job
Train the model in Amazon SageMaker using the Amazon SageMaker operator

Prerequisites

For this post, you need an existing Kubernetes cluster in EKS. For information about creating a new cluster in EKS, see Getting Started with Amazon EKS. It is recommended to use a Fargate-only cluster. For this tutorial, you will need a cluster with Kubernetes control plane version 1.13 or higher. You also need the following on a machine you can use to control the Kubernetes cluster (for example, an EC2 instance).

Kubectl (Version 1.13 or higher) – Use a kubectl version that is within one minor version of your Kubernetes cluster’s control plane. For example, a 1.13 kubectl client works with Kubernetes 1.13 and 1.14 clusters. For more information, see Installing kubectl.
AWS CLI (Version 1.16.232 or higher) – Your credentials should be configured as part of the setup. For more information, see Installing the AWS CLI version 1.
AWS IAM Authenticator for Kubernetes – For more information, see Installing aws-iam-authenticator.
Access keys or permissions – Either existing IAM access keys for the operator to use or IAM permissions to create users, attach policies to users, and create access keys.

As of this writing, smlogs only supports Linux, so you should deploy this on an EC2 machine with Ubuntu

Setting up IAM roles and permissions

For the operator to access your SageMaker resources, you first need to configure a Kubernetes service account with an OIDC authenticated role that has the proper permissions. For more information, see Enabling IAM Roles for Service Accounts on your Cluster.

Complete the following steps:

Associate an IAM OpenID Connect (OIDC) provider with your EKS cluster for authentication with AWS resources. See the following code:

# Set the AWS region and EKS cluster name
export CLUSTER_NAME="<your cluster name>"
export AWS_REGION="<your region>"
eksctl utils associate-iam-oidc-provider --cluster \ 
${CLUSTER_NAME} --region ${AWS_REGION} --approve

Your output should look like the following:

[ℹ]  using region us-east-1
[ℹ]  will create IAM Open ID Connect provider for cluster "my-cluster" in "us-east-1"
[✔]  created IAM Open ID Connect provider for cluster "my-cluster" in "us-east-1"

Now that your Kubernetes cluster in EKS has an OIDC identity provider, you can create a role and give it permissions.

Obtain the OIDC issuer URL with the following code:

aws eks describe-cluster --name ${CLUSTER_NAME} --region ${AWS_REGION} \
    --query cluster.identity.oidc.issuer --output text

This command returns a URL like the following:

https://oidc.eks.${AWS_REGION}.amazonaws.com/id/{Your OIDC ID}

If the output is None, make sure your AWS CLI has a version listed in Prerequisites.

You use the OIDC ID returned by the previous command to create your role.

Create a new file named trust.json with the following code:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::<AWS account number>:oidc-provider/oidc.eks.<EKS cluster Region>.amazonaws.com/id/<OIDC IC>"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "oidc.eks.<EKS cluster Region>.amazonaws.com/id/<OIDC IC>:aud": "sts.amazonaws.com",
        "oidc.eks.<EKS cluster Region>.amazonaws.com/id/<OIDC IC>:sub": "system:serviceaccount:sagemaker-k8s-operator-system:sagemaker-k8s-operator-default"
      }
    }
  }]
}

Update the placeholders with your OIDC ID, AWS account number, and EKS cluster Region.

Create a new IAM role that can be assumed by the cluster service accounts. See the following code:
```
aws iam create-role --role-name <role name> --assume-role-policy-document file://trust.json --output=text
```
The output will contain your role ARN.

Give the ARN to the operator for securely invoking Amazon SageMaker from the Kubernetes cluster. See the following code:

ROLE    arn:aws:iam::123456789012:role/my-role 2019-11-22T21:46:10Z    /       ABCDEFSFODNN7EXAMPLE   my-role
ASSUMEROLEPOLICYDOCUMENT        2012-10-17
STATEMENT       sts:AssumeRoleWithWebIdentity   Allow
STRINGEQUALS    sts.amazonaws.com       system:serviceaccount:sagemaker-k8s-operator-system:sagemaker-k8s-operator-default
PRINCIPAL       arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/

Give this new role access to Amazon SageMaker and attach the AmazonSageMakerFullAccess See the following code:

aws iam attach-role-policy --role-name <role name> --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

Setting up the operator on your Kubernetes cluster

To set up the operator on your Kubernetes cluster, complete the following steps:

If using Fargate (recommended), configure a namespace for the operator with this bash code:

eksctl create fargateprofile --cluster ${CLUSTER_NAME} --namespace sagemaker-k8s-operator-system

Install Amazon SageMaker Operators for Kubernetes from the GitHub repo by downloading a YAML configuration file that configures your Kubernetes cluster with the custom resource definitions and operator controller service. See the following code:
```
wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/release/rolebased/installer.yaml
```
In the installer.yaml file, update the eks.amazonaws.com/role-arn with the ARN from your OIDC-based role from the previous step.
On your Kubernetes cluster, install the Amazon SageMaker CRDs and set up your operators. See the following code:
```
kubectl apply -f installer.yaml
```

Verify that Amazon SageMaker operators are available in your Kubernetes cluster. See the following code:

$ kubectl get crd | grep sagemaker
batchtransformjobs.sagemaker.aws.amazon.com         2020-01-01T12:34:56Z
endpointconfigs.sagemaker.aws.amazon.com            2020-01-01T12:34:56Z
hostingdeployments.sagemaker.aws.amazon.com         2020-01-01T12:34:56Z
hyperparametertuningjobs.sagemaker.aws.amazon.com   2020-01-01T12:34:56Z
models.sagemaker.aws.amazon.com                     2020-01-01T12:34:56Z
trainingjobs.sagemaker.aws.amazon.com               2020-01-01T12:34:56Z

With these operators, all Amazon SageMaker’s managed and secured ML infrastructure and software optimization at scale is now available as a custom resource in your Kubernetes cluster.

To view logs from Amazon SageMaker in your command line using kubectl, install the following client:

export os="linux"

wget https://amazon-sagemaker-operator-for-k8s-us-east-1.s3.amazonaws.com/kubectl-smlogs-plugin/latest/${os}.amd64.tar.gz
tar xvzf ${os}.amd64.tar.gz

# Move binaries to a directory in your homedir.
mkdir ~/sagemaker-k8s-bin
cp ./kubectl-smlogs.${os}.amd64/kubectl-smlogs ~/sagemaker-k8s-bin/.

# This line will add the binaries to your PATH in your .bashrc.
echo 'export PATH=$PATH:~/sagemaker-k8s-bin' >> ~/.bashrc

# Source your .bashrc to update environment variables:
source ~/.bashrc

Generating your training data

After you install the operator, you can begin training. This post uses a SageMaker prebuilt container to train an XGBoost model on the MNIST dataset. This post provides a script in SageMaker Operators for Kubernetes GitHub repo that uploads the MNIST dataset to an S3 bucket in the format that the XGBoost prebuilt container expects.

To generate your training data, complete the following steps:

Create an S3 bucket. This post uses the us-east-1 Region.

Download and run the upload_xgboost_mnist_dataset See the following code:

wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/scripts/upload_xgboost_mnist_dataset/upload_xgboost_mnist_dataset
chmod +x upload_xgboost_mnist_dataset
./upload_xgboost_mnist_dataset --s3-bucket BUCKET_NAME --s3-prefix xgboost-mnist

Make sure to replace BUCKET_NAME with the name of the S3 bucket you created. This script requires you to install Python3, boto3, numpy, and argparse.

Verify that the data was successfully uploaded. The command output should look like the following:

./upload_xgboost_mnist_dataset --s3-bucket BUCKET_NAME --s3-prefix xgboost-mnist
Downloading dataset from http://deeplearning.net/data/mnist/mnist.pkl.gz
train: (50000, 784) (50000,)
Uploading 981250000 bytes to s3://BUCKET_NAME/xgboost-mnist/train/examples
validation: (10000, 784) (10000,)
Uploading 196250000 bytes to s3://BUCKET_NAME/xgboost-mnist/validation/examples
test: (10000, 784) (10000,)
Uploading 196000000 bytes to s3://BUCKET_NAME/xgboost-mnist/test/examples

The data is now uploaded to your S3 bucket.

Creating an IAM Role for SageMaker

SageMaker assumes an execution role when training. It gives permission to SageMaker so that it can read and write from S3, manage EC2 instances, and so on. This role should be different from the one attached to the OIDC provider. If you do not have a SageMaker execution role, create one with the following bash commands:

export assume_role_policy_document='{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "sagemaker.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}'
aws iam create-role --role-name <execution role name> --assume-role-policy-document file://<(echo "$assume_role_policy_document")
aws iam attach-role-policy --role-name <execution role name> --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

These commands create an IAM role that SageMaker can assume and give the role access to resources that SageMaker usually needs, like S3 and EC2. Save the created role ARN for when you prepare the training job.

Preparing your training job

Create a train.yaml YAML configuration file to start training. Specify TrainingJob as the kind to train your model on Amazon SageMaker, which is now a custom resource in your Kubernetes cluster.

Replace the following placeholders their values:

Replace BUCKET_NAME with the name of the S3 bucket you created
Replace SAGEMAKER_EXECUTION_ROLE_ARN with the ARN of the execution role created in the previous step

See the following code:

apiVersion: sagemaker.aws.amazon.com/v1
kind: TrainingJob
metadata:
  name: xgboost-mnist
spec:
  roleArn: SAGEMAKER_EXECUTION_ROLE_ARN  
  region: us-east-1
  algorithmSpecification:
    trainingImage: 811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest
    trainingInputMode: File
  outputDataConfig:
    s3OutputPath: s3://BUCKET_NAME/xgboost-mnist/models/
  inputDataConfig:
    - channelName: train
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          s3Uri: s3://BUCKET_NAME/xgboost-mnist/train/
          s3DataDistributionType: FullyReplicated
      contentType: text/csv
      compressionType: None
    - channelName: validation
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          s3Uri: s3://BUCKET_NAME/xgboost-mnist/validation/
          s3DataDistributionType: FullyReplicated
      contentType: text/csv
      compressionType: None
  resourceConfig:
    instanceCount: 1
    instanceType: ml.m4.xlarge
    volumeSizeInGB: 5
  hyperParameters:
    - name: max_depth
      value: "5"
    - name: eta
      value: "0.2"
    - name: gamma
      value: "4"
    - name: min_child_weight
      value: "6"
    - name: silent
      value: "0"
    - name: objective
      value: multi:softmax
    - name: num_class
      value: "10"
    - name: num_round
      value: "10"
  stoppingCondition:
    maxRuntimeInSeconds: 86400

The S3 data and ECR repo Region locations should be the same. If your data is in a Region other than us-east-1, update the training image location by finding alternative image URI. For more information, see Common Parameters for Built-In Algorithms.

Training the model

You can now start your training job by entering the following code:

$ kubectl apply -f train.yaml
trainingjob.sagemaker.aws.amazon.com/xgboost-mnist created

The operator creates a training job in Amazon SageMaker that uses the specifications you provided in train.yaml. You can interact with this training job as you normally would in Kubernetes. See the following code:

$ kubectl describe trainingjob xgboost-mnist
$ kubectl get trainingjob xgboost-mnist

After your training job has started, and the status shows as InProgress, you can use the smlogs plugin to read the Amazon CloudWatch logs for the job. See the following code:

$ kubectl smlogs trainingjob xgboost-mnist
xgboost-mnist-f52d88dd423411eaa1270a350733ba06/algo-1-1580260714 2020-01-28 17:19:43.244 -0800 PST [2020-01-29:01:19:40:INFO] Running standalone xgboost training.

Alternatively, you can see the job progress and information within the SageMaker console.

After your training job is complete, any compute instances that you provisioned in Amazon SageMaker for this training job are terminated.

For additional examples, see the GitHub repo.

Conclusion

Amazon SageMaker Operators for Kubernetes is generally available as of this writing in US East (Ohio), US East (N. Virginia), US West (Oregon), and EU (Ireland) AWS Regions. For more information and step-by-step tutorials, see Amazon SageMaker Operators for Kubernetes.

As always, please share your experience and feedback, or submit additional example YAML specs or operator improvements. Let us know how you’re using Amazon SageMaker Operators for Kubernetes by posting on the AWS forum for Amazon SageMaker, creating issues in the GitHub repo, or sending it through your usual AWS contacts.

About the Authors

Cade Daniel is a Software Development Engineer with AWS Deep Learning. He develops products that make training and serving DL/ML models more efficient and easy for customers. Outside of work, he enjoys practicing his Spanish and learning new hobbies.

Nicholas Thomson is a Software Development Engineer with AWS Deep Learning. He helps build the open-source deep learning infrastructure projects that power Amazon AI. In his free time, he enjoys playing pool or building proof of concept websites.

Aditya Bindal is a Senior Product Manager for AWS Deep Learning. He works on products that make it easier for customers to use deep learning engines. In his spare time, he enjoys playing tennis, reading historical fiction, and traveling.

Alex Chung is a Senior Product Manager with AWS in enterprise machine learning systems. His role is to make AWS MLOps products more accessible for Kubernetes machine learning custom environments. He’s passionate about accelerating ML adoption for a large body of users to solve global economic and societal problems. Outside machine learning, he is also a board member at a Silicon Valley nonprofit for donating stock to charity, Cocatalyst.org that optimizes donor tax benefits similar to donor advised funds.