Introducing Amazon SageMaker Operators for Kubernetes
AWS is excited to introduce Amazon SageMaker Operators for Kubernetes in general availability. This new feature makes it easier for developers and data scientists that use Kubernetes to train, tune, and deploy machine learning (ML) models in Amazon SageMaker. You can install these operators on your Kubernetes cluster to create Amazon SageMaker jobs natively using the Kubernetes API and command line Kubernetes tools, such as kubectl. For more information, see Whitepaper – Machine Learning on Amazon SageMaker and Kubernetes.
Many AWS customers use Kubernetes, an open-source, general-purpose container orchestration system, to deploy and manage containerized applications. Amazon EKS provides a managed service to deploy Kubernetes. Data scientists and developers can set up on Kubernetes repeatable ML pipelines and maintain greater control over training and inference workloads. However, to support ML workloads, you still need to write custom code to optimize the underlying ML infrastructure, provide high availability and reliability, provide data science productivity tools, and comply with appropriate security and regulatory requirements. For example, if you are a Kubernetes customer using GPUs for training and inference, you often need to change how Kubernetes schedules and scales GPU workloads to increase utilization, throughput, and availability. Similarly, for deploying trained models to production for inference, you have to spend additional time setting up and optimizing your autoscaling clusters across multiple Availability Zones.
Amazon SageMaker Operators for Kubernetes bridges this gap and spares you the heavy lifting of integrating your Amazon SageMaker and Kubernetes workflows. As of this writing, you can make a simple call from
kubectl to Amazon SageMaker, a modular and fully-managed service that makes it easier to build, train, and deploy ML models at scale. With workflows in Amazon SageMaker, compute resources are pre-configured and optimized, only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, offering near 100% utilization. With Amazon SageMaker Operators for Kubernetes, you can continue to enjoy the portability and standardization benefits of Kubernetes and EKS, along with integrating the many additional benefits that come out-of-the-box with Amazon SageMaker, no custom code required.
Amazon SageMaker and Kubernetes
Machine learning is more than just the model. The ML workflow consists of sourcing and preparing data, building ML models, training and evaluating these models, deploying them to production, and ongoing post-production monitoring. Amazon SageMaker helps you build, train, deploy, and maintain models more quickly.
However, the workflows related to building a model are often one part of a bigger pipeline that spans multiple engineering teams and services that support an overarching application. Kubernetes users, including EKS customers, deploy workloads by writing configuration files, which Kubernetes matches with available compute resources in your Kubernetes cluster. While Kubernetes gives you control and portability, running ML workloads on a Kubernetes cluster brings unique challenges. For example, the underlying infrastructure requires additional management, such as optimizing for utilization, cost, and performance; complying with appropriate security and regulatory requirements; and ensuring high availability and reliability. This undifferentiated heavy lifting takes away valuable time and resources from bringing new ML applications to market. You want to control orchestration and pipelines without having to manage the underlying ML infrastructure and services in your cluster.
Amazon SageMaker Operators for Kubernetes addresses this need by bringing Amazon SageMaker and Kubernetes together. From Kubernetes, you get a fully managed service that is designed and optimized specifically for ML workflows. Infrastructure and platform teams retain control and portability by orchestrating workloads in Kubernetes, without having to manage the underlying ML infrastructure and services. To add new capabilities to Kubernetes, you can extend the Kubernetes API by creating a custom resource that contains your application-specific or domain-specific logic and components. Operators in Kubernetes allow you to invoke these custom resources and automate associated workflows natively. You can add Amazon SageMaker as a custom resource in Kubernetes by installing SageMaker Operators for Kubernetes on your Kubernetes cluster. You can then use the following Amazon SageMaker operators:
- Train – Train ML models in Amazon SageMaker, including Managed Spot Training, to save up to 90% in training costs, and distributed training to reduce training time by scaling to multiple GPU nodes. You pay for the duration of your job, offering near 100% utilization.
- Tune – Tune model hyperparameters in Amazon SageMaker, including with Amazon EC2 Spot Instances, to save up to 90% in cost. Amazon SageMaker Automatic Model Tuning performs hyperparameter optimization to search the hyperparameter range for more accurate models, saving you days or weeks improving model accuracy.
- Real-time inference – Deploy trained models in Amazon SageMaker to fully managed autoscaling clusters, spread across multiple Availability Zones, to deliver high performance and availability in real time.
- Batch transform – Create a managed execution of a model on large datasets. You can use this for either preprocessing in preparation for training or inference of an existing trained model within Amazon SageMaker.
Amazon SageMaker Operator for Kubernetes provides you with a native Kubernetes experience for creating and interacting with your jobs, either with the Kubernetes API or with Kubernetes command line utilities such as
kubectl. Engineering teams can build automation, tooling, and custom interfaces for data scientists in Kubernetes by using these operators—all without building, maintaining, or optimizing ML infrastructure. Data scientists and developers familiar with Kubernetes can compose and interact with Amazon SageMaker training, tuning, and inference jobs natively, as you would with Kubernetes jobs executing locally. Logs from Amazon SageMaker jobs stream back to Kubernetes, allowing you to natively view logs for your model training, tuning, and prediction jobs in your command line.
Using Amazon SageMaker Operators for Kubernetes with XGBoost
This post demonstrates training a gradient-boosting model on the Modified National Institute of Standards and Technology (MNIST) dataset using the training operator. The MNIST dataset contains images of handwritten digits from 0 to 9 and is a popular ML problem. The MNIST dataset contains 60,000 training images and 10,000 test images.
This post performs the following steps:
- Install Amazon SageMaker Operators for Kubernetes on an EKS cluster
- Create a YAML config for this training job
- Train the model in Amazon SageMaker using the Amazon SageMaker operator
For this post, you need an existing Kubernetes cluster in EKS. For information about creating a new cluster in EKS, see Getting Started with Amazon EKS. It is recommended to use a Fargate-only cluster. For this tutorial, you will need a cluster with Kubernetes control plane version 1.13 or higher. You also need the following on a machine you can use to control the Kubernetes cluster (for example, an EC2 instance).
- Kubectl (Version 1.13 or higher) – Use a
kubectlversion that is within one minor version of your Kubernetes cluster’s control plane. For example, a 1.13
kubectlclient works with Kubernetes 1.13 and 1.14 clusters. For more information, see Installing kubectl.
- AWS CLI (Version 1.16.232 or higher) – Your credentials should be configured as part of the setup. For more information, see Installing the AWS CLI version 1.
- AWS IAM Authenticator for Kubernetes – For more information, see Installing aws-iam-authenticator.
- Access keys or permissions – Either existing IAM access keys for the operator to use or IAM permissions to create users, attach policies to users, and create access keys.
As of this writing,
smlogs only supports Linux, so you should deploy this on an EC2 machine with Ubuntu
Setting up IAM roles and permissions
For the operator to access your SageMaker resources, you first need to configure a Kubernetes service account with an OIDC authenticated role that has the proper permissions. For more information, see Enabling IAM Roles for Service Accounts on your Cluster.
Complete the following steps:
- Associate an IAM OpenID Connect (OIDC) provider with your EKS cluster for authentication with AWS resources. See the following code:
Your output should look like the following:
Now that your Kubernetes cluster in EKS has an OIDC identity provider, you can create a role and give it permissions.
- Obtain the OIDC issuer URL with the following code:
This command returns a URL like the following:
If the output is
None, make sure your AWS CLI has a version listed in
You use the OIDC ID returned by the previous command to create your role.
- Create a new file named
trust.jsonwith the following code:
Update the placeholders with your OIDC ID, AWS account number, and EKS cluster Region.
- Create a new IAM role that can be assumed by the cluster service accounts. See the following code:
The output will contain your role ARN.
- Give the ARN to the operator for securely invoking Amazon SageMaker from the Kubernetes cluster. See the following code:
- Give this new role access to Amazon SageMaker and attach the AmazonSageMakerFullAccess See the following code:
Setting up the operator on your Kubernetes cluster
To set up the operator on your Kubernetes cluster, complete the following steps:
- If using Fargate (recommended), configure a namespace for the operator with this bash code:
- Install Amazon SageMaker Operators for Kubernetes from the GitHub repo by downloading a YAML configuration file that configures your Kubernetes cluster with the custom resource definitions and operator controller service. See the following code:
- In the
installer.yamlfile, update the
eks.amazonaws.com/role-arnwith the ARN from your OIDC-based role from the previous step.
- On your Kubernetes cluster, install the Amazon SageMaker CRDs and set up your operators. See the following code:
- Verify that Amazon SageMaker operators are available in your Kubernetes cluster. See the following code:
With these operators, all Amazon SageMaker’s managed and secured ML infrastructure and software optimization at scale is now available as a custom resource in your Kubernetes cluster.
To view logs from Amazon SageMaker in your command line using
kubectl, install the following client:
Generating your training data
After you install the operator, you can begin training. This post uses a SageMaker prebuilt container to train an XGBoost model on the MNIST dataset. This post provides a script in SageMaker Operators for Kubernetes GitHub repo that uploads the MNIST dataset to an S3 bucket in the format that the XGBoost prebuilt container expects.
To generate your training data, complete the following steps:
- Create an S3 bucket. This post uses the
- Download and run the
upload_xgboost_mnist_datasetSee the following code:
Make sure to replace BUCKET_NAME with the name of the S3 bucket you created. This script requires you to install Python3, boto3, numpy, and argparse.
- Verify that the data was successfully uploaded. The command output should look like the following:
The data is now uploaded to your S3 bucket.
Creating an IAM Role for SageMaker
SageMaker assumes an execution role when training. It gives permission to SageMaker so that it can read and write from S3, manage EC2 instances, and so on. This role should be different from the one attached to the OIDC provider. If you do not have a SageMaker execution role, create one with the following
These commands create an IAM role that SageMaker can assume and give the role access to resources that SageMaker usually needs, like S3 and EC2. Save the created role ARN for when you prepare the training job.
Preparing your training job
train.yaml YAML configuration file to start training. Specify
TrainingJob as the kind to train your model on Amazon SageMaker, which is now a custom resource in your Kubernetes cluster.
Replace the following placeholders their values:
- Replace BUCKET_NAME with the name of the S3 bucket you created
- Replace SAGEMAKER_EXECUTION_ROLE_ARN with the ARN of the execution role created in the previous step
See the following code:
The S3 data and ECR repo Region locations should be the same. If your data is in a Region other than
us-east-1, update the training image location by finding alternative image URI. For more information, see Common Parameters for Built-In Algorithms.
Training the model
You can now start your training job by entering the following code:
The operator creates a training job in Amazon SageMaker that uses the specifications you provided in
train.yaml. You can interact with this training job as you normally would in Kubernetes. See the following code:
After your training job has started, and the status shows as
InProgress, you can use the
smlogs plugin to read the Amazon CloudWatch logs for the job. See the following code:
Alternatively, you can see the job progress and information within the SageMaker console.
After your training job is complete, any compute instances that you provisioned in Amazon SageMaker for this training job are terminated.
For additional examples, see the GitHub repo.
Amazon SageMaker Operators for Kubernetes is generally available as of this writing in US East (Ohio), US East (N. Virginia), US West (Oregon), and EU (Ireland) AWS Regions. For more information and step-by-step tutorials, see Amazon SageMaker Operators for Kubernetes.
As always, please share your experience and feedback, or submit additional example YAML specs or operator improvements. Let us know how you’re using Amazon SageMaker Operators for Kubernetes by posting on the AWS forum for Amazon SageMaker, creating issues in the GitHub repo, or sending it through your usual AWS contacts.
About the Authors
Cade Daniel is a Software Development Engineer with AWS Deep Learning. He develops products that make training and serving DL/ML models more efficient and easy for customers. Outside of work, he enjoys practicing his Spanish and learning new hobbies.
Nicholas Thomson is a Software Development Engineer with AWS Deep Learning. He helps build the open-source deep learning infrastructure projects that power Amazon AI. In his free time, he enjoys playing pool or building proof of concept websites.
Aditya Bindal is a Senior Product Manager for AWS Deep Learning. He works on products that make it easier for customers to use deep learning engines. In his spare time, he enjoys playing tennis, reading historical fiction, and traveling.
Alex Chung is a Senior Product Manager with AWS in enterprise machine learning systems. His role is to make AWS MLOps products more accessible for Kubernetes machine learning custom environments. He’s passionate about accelerating ML adoption for a large body of users to solve global economic and societal problems. Outside machine learning, he is also a board member at a Silicon Valley nonprofit for donating stock to charity, Cocatalyst.org that optimizes donor tax benefits similar to donor advised funds.