AWS Machine Learning Blog

Introducing Amazon SageMaker Operators for Kubernetes

AWS is excited to introduce Amazon SageMaker Operators for Kubernetes, a new capability that makes it easier for developers and data scientists using Kubernetes to train, tune, and deploy machine learning (ML) models in Amazon SageMaker. Customers can install these Amazon SageMaker Operators on their Kubernetes cluster to create Amazon SageMaker jobs natively using the Kubernetes API and command-line Kubernetes tools such as ‘kubectl’. In addition to this blog post, we also published a whitepaper about how to do machine learning on Amazon SageMaker and Kubernetes.

Many AWS customers use Kubernetes, an open-source general-purpose container orchestration system, to deploy and manage containerized applications, often via a managed service such as Amazon Elastic Kubernetes Service (EKS). This enables data scientists and developers, for example, to set up repeatable ML pipelines and maintain greater control over their training and inference workloads. However, to support ML workloads these customers still need to write custom code to optimize the underlying ML infrastructure, ensure high availability and reliability, provide data science productivity tools, and comply with appropriate security and regulatory requirements. For example, when Kubernetes customers use GPUs for training and inference, they often need to change how Kubernetes schedules and scales GPU workloads in order to increase utilization, throughput, and availability. Similarly, for deploying trained models to production for inference, Kubernetes customers have to spend additional time in setting up and optimizing their auto-scaling clusters across multiple Availability Zones.

Amazon SageMaker Operators for Kubernetes bridges this gap, and customers are now spared all the heavy lifting of integrating their Amazon SageMaker and Kubernetes workflows. Starting today, customers using Kubernetes can make a simple call to Amazon SageMaker, a modular and fully-managed service that makes it easier to build, train, and deploy machine learning (ML) models at scale. With workflows in Amazon SageMaker, compute resources are pre-configured and optimized, only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, offering near 100% utilization. Now with Amazon SageMaker Operators for Kubernetes, customers can continue to enjoy the portability and standardization benefits of Kubernetes and EKS, along with integrating the many additional benefits that come out-of-the-box with Amazon SageMaker, no custom code required.

Amazon SageMaker and Kubernetes

Machine learning is more than just the model. The ML workflow consists of sourcing and preparing data, building machine learning models, training and evaluating these models, deploying them to production, and ongoing post-production monitoring. Amazon SageMaker is a modular and fully managed service that helps data scientists and developers more quickly accomplish the tasks to build, train, deploy, and maintain models.

But the workflows related to building a model are often one part of a bigger pipeline that spans multiple engineering teams and services that support an overarching application. Kubernetes users, including Amazon EKS customers, deploy workloads by writing configuration files, which Kubernetes matches with available compute resources in the user’s Kubernetes cluster. While Kubernetes gives customers control and portability, running ML workloads on a Kubernetes cluster brings unique challenges. For example, the underlying infrastructure requires additional management such as optimizing for utilization, cost and performance; complying with appropriate security and regulatory requirements; and ensuring high availability and reliability. All of this undifferentiated heavy lifting takes away valuable time and resources from bringing new ML applications to market. Kubernetes customers want to control orchestration and pipelines without having to manage the underlying ML infrastructure and services in their cluster.

Amazon SageMaker Operators for Kubernetes addresses this need by bringing Amazon SageMaker and Kubernetes together. From Kubernetes, data scientists and developers get to use a fully managed service that is designed and optimized specifically for ML workflows. Infrastructure and platform teams retain control and portability by orchestrating workloads in Kubernetes, without having to manage the underlying ML infrastructure and services. To add new capabilities to Kubernetes, developers can extend the Kubernetes API by creating a custom resource that contains their application-specific or domain-specific logic and components. ‘Operators’ in Kubernetes allow users to natively invoke these custom resources and automate associated workflows. By installing SageMaker Operators for Kubernetes on your Kubernetes cluster, you can now add Amazon SageMaker as a ‘custom resource’ in Kubernetes. You can then use the following Amazon SageMaker Operators:

  • Train – Train ML models in Amazon SageMaker, including Managed Spot Training, to save up to 90% in training costs, and distributed training to reduce training time by scaling to multiple GPU nodes. You pay for the duration of your job, offering near 100% utilization.
  • Tune – Tune model hyperparameters in Amazon SageMaker, including with Amazon EC2 Spot Instances, to save up to 90% in cost. Amazon SageMaker Automatic Model Tuning performs hyperparameter optimization to search the hyperparameter range for more accurate models, saving you days or weeks of time spent improving model accuracy.
  • Inference – Deploy trained models in Amazon SageMaker to fully managed autoscaling clusters, spread across multiple Availability Zones, to deliver high performance and availability for real-time or batch prediction.

Each Amazon SageMaker Operator for Kubernetes provides you with a native Kubernetes experience for creating and interacting with your jobs, either with the Kubernetes API or with Kubernetes command-line utilities such as kubectl. Engineering teams can build automation, tooling, and custom interfaces for data scientists in Kubernetes by using these operators—all without building, maintaining, or optimizing ML infrastructure. Data scientists and developers familiar with Kubernetes can compose and interact with Amazon SageMaker training, tuning, and inference jobs natively, as you would with Kubernetes jobs executing locally. Logs from Amazon SageMaker jobs stream back to Kubernetes, allowing you to natively view logs for your model training, tuning, and prediction jobs in your command line.

Using Amazon SageMaker Operators for Kubernetes with TensorFlow

This post demonstrates training a simple convolutional neural network model on the Modified National Institute of Standards and Technology (MNIST) dataset using the Amazon SageMaker Training Operators for Kubernetes. The MNIST dataset contains images of handwritten digits from 0 to 9 and is a popular ML problem. The MNIST dataset contains 60,000 training images and 10,000 test images.

This post performs the following steps:

  • Install Amazon SageMaker Operators for Kubernetes on a Kubernetes cluster
  • Create a YAML config for training
  • Train the model using the Amazon SageMaker Operator

Prerequisites

For this post, you need an existing Kubernetes cluster in EKS. For information about creating a new cluster in Amazon EKS, see Getting Started with Amazon EKS. You also need the following on the machine you use to control the Kubernetes cluster (for example, your laptop or an EC2 instance):

  • Kubectl (Version >=1.13). Use a kubectl version that is within one minor version of your Kubernetes cluster’s control plane. For example, a 1.13 kubectl client works with Kubernetes 1.13 and 1.14 clusters. For more information, see Installing kubectl.
  • AWS CLI (Version >=1.16.232). For more information, see Installing the AWS CLI version 1.
  • AWS IAM Authenticator for Kubernetes. For more information, see Installing aws-iam-authenticator.
  • Either existing IAM access keys for the operator to use or IAM permissions to create users, attach policies to users, and create access keys.

Setting up IAM roles and permissions

Before you deploy the operator to your Kubernetes cluster, associate an IAM role with an OpenID Connect (OIDC) provider for authentication. See the following code:

# Set the Region and cluster
export CLUSTER_NAME="<your cluster name>"
export AWS_REGION="<your region>"
eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} \
	--region ${AWS_REGION} --approve

Your output should look like the following:

    [_]  eksctl version 0.10.1
    [_]  using region us-east-1
    [_]  IAM OpenID Connect provider is associated with cluster "my-cluster" in "us-east-1"

Now that your Kubernetes cluster in EKS has an OIDC identity provider, you can create a role and give it permissions. Obtain the OIDC issuer URL with the following command:

aws eks describe-cluster --name ${CLUSTER_NAME} --region ${AWS_REGION} \
	--query cluster.identity.oidc.issuer --output text

This command will return a URL like the following:

https://oidc.eks.${AWS_REGION}.amazonaws.com/id/{Your OIDC ID}

Use the OIDC ID returned by the previous command to create your role. Create a new file named ‘trust.json’ with the following code block. Be sure to update the placeholders with your OIDC ID, AWS Account Number, and EKS Cluster Region.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Federated": "arn:aws:iam::<AWS account number>:oidc-provider/oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>"
          },
          "Action": "sts:AssumeRoleWithWebIdentity",
          "Condition": {
            "StringEquals": {
              "oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>:aud": "sts.amazonaws.com",
              "oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>:sub": "system:serviceaccount:sagemaker-k8s-operator-system:sagemaker-k8s-operator-default"
            }
          }
        }
      ]
    }

Now create a new IAM role:

aws iam create-role --role-name <role name> --assume-role-policy-document file://trust.json --output=text

The output will return your ‘ROLE ARN’ that you pass to the operator for securely invoking Amazon SageMaker from the Kubernetes cluster.

    ROLE    arn:aws:iam::123456789012:role/my-role 2019-11-22T21:46:10Z    /       ABCDEFSFODNN7EXAMPLE   my-role
    ASSUMEROLEPOLICYDOCUMENT        2012-10-17
    STATEMENT       sts:AssumeRoleWithWebIdentity   Allow
    STRINGEQUALS    sts.amazonaws.com       system:serviceaccount:sagemaker-k8s-operator-system:sagemaker-k8s-operator-default
    PRINCIPAL       arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/

Finally, give this new role access to Amazon SageMaker and attach the AmazonSageMakerFullAccess policy.

aws iam attach-role-policy --role-name <role name>  --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

Setting up the operator on your Kubernetes cluster

Use the Amazon SageMaker Operators from the GitHub repo by downloading a YAML configuration file that installs the operator for you.

wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/release/rolebased/installer.yaml

In the installer.yaml file, update the eks.amazonaws.com/role-arn with the ARN from your OIDC-based role from the previous step.

Now on your Kubernetes cluster, install the Amazon SageMaker CRD and set up your operators.

kubectl -f apply installer.yaml

Verify that Amazon SageMaker Operators are available in your Kubernetes cluster. See the following code:

$ kubectl get crd | grep sagemaker
batchtransformjobs.sagemaker.aws.amazon.com         2019-11-20T17:12:34Z
endpointconfigs.sagemaker.aws.amazon.com            2019-11-20T17:12:34Z
hostingdeployments.sagemaker.aws.amazon.com         2019-11-20T17:12:34Z
hyperparametertuningjobs.sagemaker.aws.amazon.com   2019-11-20T17:12:34Z
models.sagemaker.aws.amazon.com                     2019-11-20T17:12:34Z
trainingjobs.sagemaker.aws.amazon.com               2019-11-20T17:12:34Z

With these operators, all of Amazon SageMaker’s managed and secured ML infrastructure and software optimization at scale is now available as a custom resource in your Kubernetes cluster.

To view logs from Amazon SageMaker in our command line using kubetl, we will install the following client:

export os="linux"

wget https://amazon-sagemaker-operator-for-k8s-us-east-1.s3.amazonaws.com/kubectl-smlogs-plugin/latest/${os}.amd64.tar.gz
tar xvzf ${os}.amd64.tar.gz

# Move binaries to a directory in your homedir.
mkdir ~/sagemaker-k8s-bin
cp ./kubectl-smlogs.${os}.amd64/kubectl-smlogs ~/sagemaker-k8s-bin/.

# This line will add the binaries to your PATH in your .bashrc. 

echo 'export PATH=$PATH:~/sagemaker-k8s-bin' >> ~/.bashrc

# Source your .bashrc to update environment variables:
source ~/.bashrc

Preparing your training job

Before you create a YAML config for your Amazon SageMaker training job, create a container that includes your Python training code, which is available in the tensorflow_distributed_mnist GitHub repo. Use TensorFlow GPU images provided by AWS Deep Learning Containers to create your Dockerfile. See the following code:

# Start with AWS Deep Learning Container Image 
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.13-gpu-py3
 
## Add Training Script
COPY train.py /opt/ml/code/train.py
 
ENV SAGEMAKER_PROGRAM train.py
ENV SM_MODEL_DIR=/opt/ml/model

For this post, you uploaded the MNIST training dataset to an S3 bucket. Create a train.yaml YAML configuration file to start training. Specify TrainingJob as a custom resource to train your model on Amazon SageMaker, which is now a custom resource on your Kubernetes cluster.

apiVersion: sagemaker.aws.amazon.com/v1
kind: TrainingJob
metadata:
  name: tf-mnist
spec:
    algorithmSpecification:
        trainingImage: 578276202366.dkr.ecr.us-west-2.amazonaws.com/mnist-demo:latest
        trainingInputMode: File
    roleArn: {YOUR ROLE:ARN}
    region: us-west-2
    outputDataConfig:
        s3OutputPath: s3://{YOUR OUTPUT PATH}
    resourceConfig:
        instanceCount: 1
        instanceType: ml.p2.8xlarge
        volumeSizeInGB: 30
    stoppingCondition:
        maxRuntimeInSeconds: 86400
    inputDataConfig:
        - channelName: train
          dataSource:
            s3DataSource:
                s3DataType: S3Prefix
                s3Uri: s3://sagemaker-us-west-2-578276202366/data/DEMO-mnist
                s3DataDistributionType: FullyReplicated
          contentType: text/csv
          compressionType: None

Training the model

You can now start your training job by entering the following:

$ kubectl apply -f training.yaml
trainingjob.sagemaker.aws.amazon.com/tf-mnist created

Amazon SageMaker Operator creates a training job in Amazon SageMaker using the specifications you provided in train.yaml. You can interact with this training job as you normally would in Kubernetes. See the following code:

$ kubectl describe trainingjob tf-mnist
$ kubectl get trainingjob tf-mnist

$ kubectl smlogs trainingjob tf-mnist
tf-mnist-fa964bf80e4b11ea89d40e3ef2d156bc/algo-1-1574553166 1574553266153 2019-11-23 23:54:25,480 sagemaker-containers INFO     Reporting training SUCCESS

After your training job is complete, any compute instances that were provisioned in Amazon SageMaker for this training job are terminated.

For additional examples, see the GitHub repo.

New Amazon SageMaker capabilities are now generally available

Amazon SageMaker Operators for Kubernetes are generally available as of this writing in US East (Ohio), US East (N. Virginia), US West (Oregon), and EU (Ireland) AWS Regions. For more information and step-by-step tutorials, see our user guide.

As always, please share your experience and feedback, or submit additional example YAML specs or Operator improvements. Let us know how you’re using Amazon SageMaker Operators for Kubernetes by posting on the AWS forum for Amazon SageMaker, creating issues in the GitHub repo, or sending it through your usual AWS contacts.

 


About the Author

Aditya Bindal is a Senior Product Manager for AWS Deep Learning. He works on products that make it easier for customers to use deep learning engines. In his spare time, he enjoys playing tennis, reading historical fiction, and traveling.