AWS Machine Learning Blog

Introducing Amazon SageMaker Components for Kubeflow Pipelines

Today we’re announcing Amazon SageMaker Components for Kubeflow Pipelines. This post shows how to build your first Kubeflow pipeline with Amazon SageMaker components using the Kubeflow Pipelines SDK.

Kubeflow is a popular open-source machine learning (ML) toolkit for Kubernetes users who want to build custom ML pipelines.  Kubeflow Pipelines is an add-on to Kubeflow that lets you build and deploy portable and scalable end-to-end ML workflows. However, when using Kubeflow Pipelines, data scientists still need to implement additional productivity tools such as data-labeling workflows and model-tuning tools.

Additionally, with Kubeflow Pipelines, ML ops teams need to manage a Kubernetes cluster with CPU and GPU instances, and keep its utilization high at all times to provide the best return on investment. Maximizing the utilization of a cluster across data science teams is challenging and adds operational overhead to the ML ops teams. For example, you should restrict GPU instances to demanding tasks such as deep learning training and inference, and use CPU instances for the less demanding tasks such data preprocessing and running essential services such as Kubeflow Pipeline control plane.

As an alternative, with Amazon SageMaker Components for Kubeflow Pipelines, you can take advantage of powerful Amazon SageMaker features such as fully managed services, including data labeling, large-scale hyperparameter tuning and distributed training jobs, one-click secure and scalable model deployment, and cost-effective training through Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances.

Amazon SageMaker Components for Kubeflow Pipelines

You can use SageMaker Components in your Kubeflow Pipelines to invoke SageMaker jobs for steps of your ML workflow, without having to worry about how it runs under the hood. As a data scientist or ML developer, you focus on constructing and running ML experiments in a pipeline that’s portable and scalable. A pipeline component has reusable code within a containerized application, which can be shared to increase productivity.

Let’s say you want to take advantage of hyperparameter tuning in Amazon SageMaker. You can use the hyperparameter optimization component and when your pipeline job runs, the hyperparameter tuning step runs on the fully-managed infrastructure of Amazon SageMaker. You can see how this works in the following section; this section takes a look at how Amazon SageMaker Components work.

In a typical Kubeflow pipeline, each component encapsulates your logic in a container image. As a developer or data scientist, you bring in your training, data preprocessing, model serving, or other logic wrapped in a Kubeflow Pipelines ContainerOp function, which builds your code into a new container. Alternatively, you can put the code into a custom container image and push it to a container registry such as Amazon Elastic Container Registry (Amazon ECR). When the pipeline runs, the component’s container is instantiated on one of the worker nodes on the Kubernetes cluster running Kubeflow, and your logic is executed. Pipeline components can read outputs from the previous components and create outputs that the next component in the pipeline can consume. The following diagram illustrates this pipeline workflow.

When you use Amazon SageMaker Components in your Kubeflow pipeline, rather than encapsulating your logic in a custom container, you simply load the components and describe your pipeline using the Kubeflow Pipelines SDK. When the pipeline runs, your instructions are translated into an Amazon SageMaker job or deployment. This workload then runs on the fully managed infrastructure of Amazon SageMaker. You also get all the benefits of a typical Amazon SageMaker capability, including managed spot training, automatic scaling of endpoints, and more. The following diagram illustrates this enhanced pipeline.

At launch, the following Amazon SageMaker capabilities are supported:

AWS will continue to add more capabilities over time, so consider bookmarking the link to Amazon SageMaker Components directory in the official Kubeflow Pipelines GitHub repo.

Getting started with Amazon SageMaker Components for Kubeflow Pipelines

To illustrate how you can use Amazon SageMaker Components in Kubeflow, let’s build a small pipeline with the steps shown in the following figure.

Real-world pipelines are typically more advanced, and include additional steps such as data ingestion, data preprocessing, data transformation, and more.

You can also find a video walkthrough, Scaling Machine Learning on Kubernetes and Kubeflow with SageMaker, on YouTube.

Let’s take a closer look at what each step in the pipeline does:

Component 1: Hyperparameter tuning job

The first component runs an Amazon SageMaker hyperparameter tuning job to optimize the following hyperparameters:

  • learning-rate – [0.0001, 0.1] log scale
  • optimizer – [sgd, adam]
  • batch-size– [32, 128, 256]
  • model-type – [resnet, custom model]

Input: N/A

Output: Best hyperparameters

Component 2: Selecting the best hyperparameters

During the hyperparameter search in the previous step, models are only trained for 10 epochs to determine well-performing hyperparameters. In the second step, the best hyperparameters are taken and the epochs are updated to 80 to give the best hyperparameters an opportunity to deliver higher accuracy in the next step.

Input: Best hyperparameters
Output: Best hyperparameters with updated epochs (80)

Component 3: Training job with the best hyperparameters

The third component runs an Amazon SageMaker training job using the best hyperparameters and for higher epochs.

Input: Best hyperparameters with updated epochs (80)
Output: Training job name

Component 4: Creating a model for deployment

The fourth component creates an Amazon SageMaker model artifact.

Input: Training job name
Output: Model artifact name

Component 5: Deploying the inference endpoint

The final component deploys a model with Amazon SageMaker deployment.

Input: Model artifact name
Output: N/A

Prerequisites

To run the following use case, you need the following:

  • Kubernetes cluster – You can use your existing cluster or create a new one. The fastest way to get one up and running is to launch an Amazon Elastic Kubernetes Service (Amazon EKS) cluster is using the eksctl For instructions, see Getting started with eksctl. Create a simple cluster with two CPU nodes to run this example. We tested this example on a 2 c5.xlarge. You just need enough node resources to run the Amazon SageMaker Component containers and Kubeflow. Training and deployments run on the Amazon SageMaker managed infrastructure.
  • Kubeflow Pipelines – Install Kubeflow Pipelines on your cluster. For instructions, see Step 1 in Deploying Kubeflow Pipelines. Your Kubeflow Pipelines version must be 0.5.0 or above. Optionally, you can install all of Kubeflow, which includes Kubeflow Pipelines.
  • Amazon SageMaker Components prerequisites –For instructions on setting up AWS Identity and Access Management (IAM) roles and permissions, see Amazon SageMaker Components for Kubeflow Pipelines You need two IAM roles for the following:
    • Kubeflow pipeline pods to access Amazon SageMaker and launch jobs and deployments.
    • Amazon SageMaker to access other AWS resources such as Amazon Simple Storage Service (Amazon S3) and Amazon ECR.

Choosing a gateway instance to launch an Amazon EKS cluster

You can launch an Amazon EKS cluster from your laptop, desktop, EC2 instance, or Amazon SageMaker notebook instance. This choice is yours. This instance is typically called a gateway instance. Because Amazon EKS offers a fully managed control plane, you only use the gateway instance to interact with the Kubernetes API and worker nodes. We tested this example by using a c5.xlarge EC2 instance as a gateway instance.

The code, configuration files, Jupyter notebooks, and Dockerfiles used in this post are available on GitHub. The following walkthrough is provided to explain the key concepts. Rather than copying code from these steps, we recommend running the prepared Jupyter notebooks on GitHub.

Step 1: Cloning the example repository

Open a terminal and SSH to the Amazon EC2 gateway instance that you used to create your Amazon EKS cluster. After you log in, clone the example repository to access the example Jupyter notebook. See the following code:

cd
git clone https://github.com/shashankprasanna/kubeflow-pipelines-sagemaker-examples.git
cd kubeflow-pipelines-sagemaker-examples

Step 2: Opening the example Jupyter notebook on your gateway instance

To open the Jupyter notebook on your gateway instance, complete the following steps:

  1. Launch Jupyterlab on your gateway instance and access it on your local machine with the following code:
    jupyter-lab

    If you’re running the Jupyterlab server on an EC2 instance, set up a tunnel to the EC2 instance so you can access the Jupyterlab client on your local laptop or desktop. See the following code:

    ssh -N -L 0.0.0.0:8081:localhost:8081 -L 0.0.0.0:8888:localhost:8888 -i ~/.ssh/<key_pair>.pem ubuntu@<IP_ADDRESS>

    If you’re using Amazon Linux instead of Ubuntu, you have to use ec2-user as the username. Update the IP address of the EC2 instance and use the appropriate key pair.

You can now access Jupyter lab at http://localhost:8888 on your local machine.

  1. Access the Kubeflow dashboard by running the following on your gateway instance:
    kubectl port-forward svc/istio-ingressgateway -n istio-system 8081:80
    

You can now access the Kubeflow dashboard at http://localhost:8081.

  1. Open the example Jupyter notebook.

Amazon SageMaker supports two modes for training jobs (the GitHub repo includes one Jupyter notebook for each approach):

    1. Bring your own Docker container image – In this mode, you can provide your own Docker container for training. Build your container with your training scripts and push it Amazon ECR, which is a container registry. Amazon SageMaker pulls your container image, instantiates it, and runs training. The kfp-sagemaker-custom-container.ipynb Jupyter notebook implements this approach.
    2. Bring your own training script aka script mode – In this mode, you don’t have to deal with Docker containers. Simply bring your ML training scripts in popular frameworks such as TensorFlow, PyTorch, MXNet, or XGBoost and upload it to Amazon S3. Amazon SageMaker automatically pulls the appropriate container, downloads your training scripts, and runs it. This mode is ideal if you don’t want to deal with Docker containers. Thekfp-sagemaker-script-mode.ipynb Jupyter notebook implements this approach.

The following example takes a closer look at the first approach (bringing your own Docker container image). You walk through all the important steps in the kfp-sagemaker-custom-container.ipynb Jupyter notebook. Having it open makes it easy for you to follow along.

The following screenshot shows the kfp-sagemaker-custom-container.ipynb notebook.

Step 3: Installing Kubeflow Pipelines SDK and loading Amazon SageMaker pipeline components

To install the SDK and load the pipeline components, complete the following steps:

  1. Install the Kubeflow Pipelines SDK with the following code:
    pip install kfp --upgrade
  2. Import Kubeflow Pipeline packages in Python with the following code:
    import kfp
    from kfp import components
    from kfp.components import func_to_container_op
    from kfp import dsl
    
  3. Load Amazon SageMaker Components in Python with the following code:
    sagemaker_hpo_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/master/components/aws/sagemaker/hyperparameter_tuning/component.yaml')
    sagemaker_train_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/master/components/aws/sagemaker/train/component.yaml')
    sagemaker_model_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/master/components/aws/sagemaker/model/component.yaml')
    sagemaker_deploy_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/master/components/aws/sagemaker/deploy/component.yaml')

For production workflows, we recommend pinning the components to a specific commit instead of using master. This makes sure that any compatibility-breaking changes in the future don’t affect your production pipelines.

For example, to use a specific commit version, replace master with the commit hash as follows in each of the load components:

sagemaker_train_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/cb36f87b727df0578f4c1e3fe9c24a30bb59e5a2/components/aws/sagemaker/train/component.yaml')

Step 4: Preparing training datasets and uploading to Amazon S3

To prepare and upload the datasets, enter the following code:

import sagemaker
import boto3

sess = boto3.Session()
account = boto3.client('sts').get_caller_identity().get('Account')
sm = sess.client('sagemaker')
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session(boto_session=sess)

bucket_name = sagemaker_session.default_bucket()
job_folder      = 'jobs'
dataset_folder  = 'datasets'
local_dataset = 'cifar10'

!python generate_cifar10_tfrecords.py —data-dir {local_dataset}
datasets = sagemaker_session.upload_data(path='cifar10', key_prefix='datasets/cifar10-dataset')

In the preceding code, you first import sagemaker and boto3 packages, which gives you access to the current machine’s IAM role and default S3 buckets. The generate_cifar10_tfrecords.py Python script uses TensorFlow to download the data, convert it into TFRecord format, and upload it to Amazon S3.

Step 5: Building your Docker container and pushing it to Amazon ECR

The build_docker_push_to_ecr.ipynb Jupyter notebook provides all the steps to build and push a container to Amazon ECR.

The Docker directory also includes the Dockerfile, training and inference Python scripts, and their dependencies in a requirements file. The following screenshot shows the contents of the Docker folder, the Dockerfile, and the build_docker_push_to_ecr.ipynb Jupyter notebook.

For more information about how Amazon SageMaker runs custom containers, see Use Your Own Algorithms or Models with Amazon SageMaker.

If you don’t want to build your own container, you can always have Amazon SageMaker manage it for you. In this approach, you upload your training scripts to Amazon S3 as shown in the kfp-sagemaker-script-mode.ipynb notebook.

Step 6: Creating a Kubeflow pipeline using Amazon SageMaker Components

A Kubeflow pipeline can be expressed as a function decorated with @dsl.pipeline as shown in the following code and in kfp-sagemaker-custom-container.ipynb. For more information, see Overview of Kubeflow Pipelines.

@dsl.pipeline(
    name='cifar10 hpo train deploy pipeline',
    description='cifar10 hpo train deploy pipeline using sagemaker'
)
def cifar10_hpo_train_deploy(region='us-west-2',
                           training_input_mode='File',
                           train_image=f'{account}.dkr.ecr.us-west-2.amazonaws.com/sagemaker-kubernetes:latest',
                           serving_image='763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-inference:1.15.2-cpu',
                           volume_size='50',
                           max_run_time='86400',
                           instance_type='ml.p3.2xlarge',
                           network_isolation='False',
                           traffic_encryption='False',
                           ... 

In this code example, you create a new function called cifar10_hpo_training_deploy() and define arguments that are common to all the steps in the pipeline. Within the function, you then define your five pipeline components.

Component 1: Amazon SageMaker hyperparameter tuning job

This component describes options for a hyperparameters tuning job. See the following code:

hpo = sagemaker_hpo_op(
        region=region,
        image=train_image,
        training_input_mode=training_input_mode,
        strategy='Bayesian',
        metric_name='val_acc',
        metric_definitions='{"val_acc": "val_acc: ([0-9\\\\.]+)"}',
        metric_type='Maximize',
        static_parameters='{ \
            "epochs": "1", \
            "momentum": "0.9", \
            "weight-decay": "0.0002", \
            "model_dir":"s3://'+bucket_name+'/jobs", \
            "sagemaker_region": "us-west-2" \
        }',
        continuous_parameters='[ \
            {"Name": "learning-rate", "MinValue": "0.0001", "MaxValue": "0.1", "ScalingType": "Logarithmic"} \
        ]',
        categorical_parameters='[ \
            {"Name": "optimizer", "Values": ["sgd", "adam"]}, \
            {"Name": "batch-size", "Values": ["32", "128", "256"]}, \
            {"Name": "model-type", "Values": ["resnet", "custom"]} \
        ]',
        channels=channels,
        output_location=f's3://{bucket_name}/jobs',
        instance_type=instance_type,
        instance_count='1',
        volume_size=volume_size,
        max_num_jobs='16',
        max_parallel_jobs='4'
 
...

These options include tuning strategy (bayesian), metric to optimize (validation accuracy), continuous hyperparameters (learning rate), categorical hyperarmeters (optimizer, batch size, and model type), and static hyperparameters that remain unchanged, such as number of epochs = 10.

You also specify the number of jobs as 16. Amazon SageMaker provisions 16 GPU instances to run the hyperparmeter tuning job.

Component 2: Custom component to update the number of epochs

The output of the first component is captured in the hpo variable. This component takes the best hyperparameters and updates the number of epochs. See the following code:

training_hyp = get_best_hyp_op(hpo.outputs['best_hyperparameters'])

To do that, you define a custom function that takes this output and updates the number of epochs to 80. This allows you to run the next training job on the best set of hyperparameters, but for much longer. See the following code:

def update_best_model_hyperparams(hpo_results, best_model_epoch = "80") -> str:
    import json
    r = json.loads(str(hpo_results))
    return json.dumps(dict(r,epochs=best_model_epoch))

get_best_hyp_op = func_to_container_op(update_best_model_hyperparams)

Component 3: Amazon SageMaker training job

This component describes an Amazon SageMaker training job using the best hyperparameters and updated number of epochs from the previous steps. See the following code:

    training = sagemaker_train_op(
        region=region,
        image=train_image,
        training_input_mode=training_input_mode,
        hyperparameters=training_hyp.output,
        channels=channels,
        instance_type=instance_type,
        instance_count='1',
        volume_size=volume_size,
        max_run_time=max_run_time,
        model_artifact_path=f's3://{bucket_name}/jobs',
        network_isolation=network_isolation,
        traffic_encryption=traffic_encryption,
        spot_instance=spot_instance,
        role=role,
    )

Component 4: Amazon SageMaker model creation

This component creates an Amazon SageMaker model artifact that you can deploy and host as an inference endpoint. See the following code:

    create_model = sagemaker_model_op(
        region=region,
        model_name=training.outputs['job_name'],
        image=serving_image,
        model_artifact_url=training.outputs['model_artifact_url'],
        network_isolation=network_isolation,
        role=role
    )

Component 5: Amazon SageMaker model deployment

Finally, this component deploys the model. See the following code:

    prediction = sagemaker_deploy_op(
        region=region,
        model_name_1=create_model.output,
        instance_type_1='ml.m5.large'
    )

Compiling and running your pipeline

Using the Kubeflow pipeline compiler, you compile the pipeline, create an experiment, and run the pipeline. See the following code:

kfp.compiler.Compiler().compile(cifar10_hpo_train_deploy,'sm-hpo-train-deploy-pipeline.zip')
client = kfp.Client()
aws_experiment = client.create_experiment(name='sm-kfp-experiment')
 
exp_name    = f'cifar10-hpo-train-deploy-kfp-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
my_run = client.run_pipeline(aws_experiment.id, exp_name, 'sm-hpo-train-deploy-pipeline.zip')

Step 7: Results

The following is an annotated screenshot of a Kubeflow pipeline after it’s finished executing. All the steps except “Custom function to update epochs” are Amazon SageMaker capabilities running as part of a Kubeflow pipeline.

You can also monitor the progress of each step on the Amazon SageMaker console. The following screenshot shows the Amazon SageMaker console for the hyperparameter tuning job from the first step in the preceding pipeline.

Testing the endpoint

When the entire pipeline is finished running, your model is hosted as an inference endpoint, as shown in the following screenshot.

To test the endpoint, copy the name of the endpoint and use the boto3 SDK to get predictions:

import json, boto3, numpy as np
client = boto3.client('runtime.sagemaker')
 
file_name = '1000_dog.png'
with open(file_name, 'rb') as f:
    payload = f.read()
 
response = client.invoke_endpoint(EndpointName='Endpoint-20200522021801-DR5P', 
                                   ContentType='application/x-image', 
                                   Body=payload)
pred = json.loads(response['Body'].read())['predictions']
labels = ['airplane','automobile','bird','cat','deer','dog','frog','horse','ship','truck']
for l,p in zip(labels, pred[0]):
    print(l,"{:.4f}".format(p*100))

You should get the following output for the picture of a dog:

airplane 0.0000
automobile 0.0000
bird 0.0001
cat 0.0115
deer 0.0000
dog 99.9883
frog 0.0000
horse 0.0000
ship 0.0000
truck 0.0000

Conclusion

This post introduces how you can configure Kubeflow Pipelines to run ML jobs with Amazon SageMaker. Kubeflow Pipelines is an open-source ML orchestration platform, and is popular among developers who want to build and manage custom ML workflows on Kubernetes. However, many developers and ML ops teams find Kubeflow Pipelines challenging to operate because you also have to manage ML-optimized Kubernetes clusters, which impacts the total return on investment and drives a higher total cost of ownership.

With the SageMaker Components for Kubeflow Pipelines, you can continue to manage your pipelines in Kubeflow Pipelines but rely on the managed capabilities of Amazon SageMaker for the ML tasks in your ML pipelines. Data scientists and ML developers can also take advantage of the latest innovations in Amazon SageMaker, such as fully-managed hyperparameter tuning, distributed training, managed spot training, autoscaling, and more. We also presented an end-to-end demo of creating and running a Kubeflow pipeline using Amazon SageMaker Components. The complete example is available on GitHub.

If you prefer learning by watching, the following video on YouTube, Scaling Machine Learning on Kubernetes and Kubeflow with SageMaker, provides an overview of the Amazon SageMaker Components for Kubeflow Pipelines and a walkthrough of the use case this post discussed.

If you have questions or comments about Amazon SageMaker Components or this post, please leave a comment or create an issue on the Kubeflow Pipelines GitHub repo.


About the Authors

Shashank Prasanna is an AI & Machine Learning Technical Evangelist at Amazon Web Services (AWS) where he focuses on helping engineers, developers and data scientists solve challenging problems with machine learning. Prior to joining AWS, he worked at NVIDIA, MathWorks (makers of MATLAB & Simulink) and Oracle in product marketing, product management, and software development roles.

 

 

Alex Chung is a Senior Product Manager with AWS in enterprise machine learning systems. His role is to make AWS MLOps products more accessible for Kubernetes machine learning custom environments. He’s passionate about accelerating ML adoption for a large body of users to solve global economic and societal problems. Outside machine learning, he is also a board member at a Silicon Valley nonprofit for donating stock to charityCocatalyst.org that optimizes donor tax benefits similar to donor advised funds.