Kubeflow on Amazon EKS

NOTE: Since this blog post was written, much about Kubeflow has changed. While we are leaving it up for historical reference, more accurate information about Kubeflow on AWS can be found here.

Kubeflow + Amazon EKS logos The Kubeflow project is designed to simplify the deployment of machine learning projects like TensorFlow on Kubernetes. There are also plans to add support for additional frameworks such as MXNet, Pytorch, Chainer, and more. These frameworks can leverage GPUs in the Kubernetes cluster for machine learning tasks.

Recently, we announced support of P2 and P3 GPU worker instances for Amazon EKS. While it’s possible to run machine learning workloads with CPU instances, GPU instances have thousands of CUDA cores, which significantly improve performance in the training of deep neural networks and processing large data sets. This post will demonstrate how to deploy Kubeflow on Amazon EKS clusters with P3 worker instances. We will then show how you can use Kubeflow to easily perform machine learning tasks like training and model serving on Kubernetes. We will be using a Jupyter notebook for our training based on the TensorFlow framework. A Jupyter notebook is an open source web application that allows us to create and share machine learning documents in various programming languages like Python, Scala, R, etc. A Python notebook is used in our example.

Prerequisites

AWS CLI
An environment where you can build Docker images. We recommend using AWS Cloud9 IDE, as it has Docker and AWS CLI pre-installed. Having Docker and AWS CLI installed on your local machine will work, too.
Subscribe to the EKS-optimized AMI with GPU Support from the AWS Marketeplace.
Ensure that you can launch at least two instances of GPU instances (P2 or P3); you can raise this limit through the EC2 console.
Have the kubectl command line tool installed.

Follow the instructions to create an EKS cluster with GPU instances. Alternatively, use the eksctl command line tool from Weaveworks to spin up an EKS cluster. For example, the following command will spin up a cluster with two worker nodes of p3.8xlarge instances in the us-west-2 region:

$ eksctl create cluster eks-kubeflow --node-type=p3.8xlarge --nodes 2 --region us-west-2 --timeout=40m

Amazon EKS Cluster Validation

Run this command to apply the Nvidia Kubernetes device plugin as a daemonset on each worker node:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml

You can issue these commands to check the status of nvidia-device-plugin daemonsets and the corresponding pods:

$ kubectl get daemonset -n kube-system

NAME                             DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
aws-node                         2         2         2         2            2           <none>          2d
kube-proxy                       2         2         2         2            2           <none>          2d
nvidia-device-plugin-daemonset   2         2         2         2            2           <none>          2d

$ kubectl get pods -n kube-system -owide |grep nvid

nvidia-device-plugin-daemonset-7842r 1/1 Running 0 2d 192.168.118.128 ip-192-168-111-8.us-west-2.compute.internal 
nvidia-device-plugin-daemonset-7cnnd 1/1 Running 0 2d 192.168.179.50 ip-192-168-153-27.us-west-2.compute.internal

Once the nvidia-device-plugin daemonsets are running, the next command confirms that there are four GPUs in each worker node:

$ kubectl get nodes \
 "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu,EC2:.metadata.labels.beta\.kubernetes\.io/instance-type,AZ:.metadata.labels.failure-domain\.beta\.kubernetes\.io/zone"
 
NAME                                           GPU    EC2          AZ
ip-192-168-177-96.us-west-2.compute.internal   4      p3.8xlarge   us-west-2a
ip-192-168-246-95.us-west-2.compute.internal   4      p3.8xlarge   us-west-2c

Storage Class for Persistent Volume

Kubeflow requires a default storage class to spawn Jupyter notebooks with attached persistent volumes. A StorageClass in Kubernetes provides a way to describe the type of storage (e.g., types of EBS volume: io1, gp2, sc1, st1) that an application can request for its persistent storage. The following command creates a Kubernetes default storage class for dynamic provisioning of persistent volumes backed by Amazon Elastic Block Store (EBS) with the general-purpose SSD volume type (gp2).

$ cat <<EOF | kubectl create -f -
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: gp2
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
reclaimPolicy: Delete
mountOptions:
  - debug
EOF

Validate that the default StorageClass is created using the command below:

$ kubectl get storageclass

NAME            PROVISIONER             AGE
gp2 (default)   kubernetes.io/aws-ebs   2d

Install Kubeflow

Kubeflow uses ksonnet, a command line tool that simplifies the configuration and deployment of applications in multiple Kubernetes environments. Ksonnet abstracts Kubernetes resources as Prototypes. Ksonnet uses these prototypes to generate Components as Kubernetes YAML files, which are tuned for specific implementations by filling in the parameters of the prototypes. A different set of parameters can be used for each Kubernetes environment.

Download ksonnet CLI. On MacOS, you can also use brew install ksonnet/tap/ks.

Validate that you have version 0.12.0 of ksonnet:

$ ks version
 
ksonnet version: 0.12.0
jsonnet version: v0.11.2
client-go version: kubernetes-1.10.4

Install Kubeflow on Amazon EKS

First, create a new Kubernetes namespace for the Kubeflow deployment:

$ export NAMESPACE=kubeflow
$ kubectl create namespace ${NAMESPACE}

Next, download the current version of the Kubeflow deployment script; it will clone the Kubeflow repository from GitHub.

$ export KUBEFLOW_VERSION=0.2.5
$ export KUBEFLOW_DEPLOY=false
$ curl https://raw.githubusercontent.com/kubeflow/kubeflow/v${KUBEFLOW_VERSION}/scripts/deploy.sh | bash

The following commands will set the namespace in the ksonnet default environment to kubeflow and deploy Kubeflow on Amazon EKS.

$ cd kubeflow_ks_app/
$ ks env set default --namespace ${NAMESPACE}
$ ks apply default

Take note of the following:

The KUBEFLOW_DEPLOY flag disables the deploy.sh script from automatically deploying Kubeflow before we configure our ksonnet environment.
Kubeflow by default will enable anonymous usage reporting. If you do not want to provide usage reporting, execute ks param set kubeflow-core reportUsage false before you run ks apply default.
Ksonnet uses GitHub to pull Kubeflow scripts, if you encounter GitHub API rate limiting, you can fix that by creating a GitHub API token. Refer to the Kubeflow troubleshooting guide for more details.

To check the status of Kubeflow’s deployment, list out the pods created in the kubeflow namespace:

$ kubectl get pod -n ${NAMESPACE}

You should get output like this:

NAME                                       READY     STATUS    RESTARTS   AGE
ambassador-849fb9c8c5-dglsc                 2/2      Running     0        1m
ambassador-849fb9c8c5-jh8vk                 2/2      Running     0        1m
ambassador-849fb9c8c5-vxvkg                 2/2      Running     0        1m
centraldashboard-7d7744cccb-97r4v           1/1      Running     0        1m
tf-hub-0                                    1/1      Running     0        1m
tf-job-dashboard-bfc9bc6bc-6zzns            1/1      Running     0        1m
tf-job-operator-v1alpha2-756cf9cb97-rdrjj   1/1      Running     0        1m

The roles of these pods in Kubeflow are as follows:

tf-hub-0: JupyterHub web application that spawns and manages Jupyter notebooks.
tf-job-operator, tf-job-dashboard: Runs and monitors TensorFlow jobs in Kubeflow.
ambassador: Ambassador API Gateway that routes services for Kubeflow.
centraldashboard: Kubeflow central dashboard UI.

A Data Scientist’s Workflow Using Kubeflow

Let’s walk through a simple tutorial provided by the Kubeflow’s example repository.

We will use the github_issue_summarization example, which applies a sequence-to-sequence model to summarize text found in GitHub issues. Sequence-to-sequence (seq2seq) is a supervised learning model where an input is a sequence of tokens (in this example, a long string of words in a GitHub issue), and the output generated is another sequence of tokens (a predicted shorter string that is a summary of the GitHub issue). Other use cases of seq2seq include machine translation of languages and speech-to-text.

First, we will use a Jupyter notebook to download the GitHub issues dataset and train the seq2seq model. Our Jupyter notebook will run as a Kubernetes pod with GPU attached to speed up the training process. Once we have our trained model, we will serve it with a simple Python microservice using Seldon Core. Seldon Core allow us to deploy our machine learning models on Kubernetes and expose them via REST and gRPC automatically.

The detailed steps are depicted in the following diagram:

Diagram: A Data Scientist’s Workflow Using Kubeflow
The steps we’ll be following are:

Build a Docker image for a Jupyter notebook with GPU support, and push that image to the Amazon Elastic Container Registry (Amazon ECR).
Launch the Jupyter notebook through JupyterHub.
Perform machine learning and generated a trained model in the Jupyter notebook.
Build a Docker image for model serving microservices using the Seldon Core Python wrapper and our trained model.
Launch the prediction microservice using Seldon Core behind an Ambassador API gateway.
Use curl CLI to generate a prediction of a summary for a given GitHub issue.

Build the Docker Image for a Jupyter Notebook

Execute the following commands to build a Docker image of the Jupyter notebook with GPU support. It will also include the necessary files for sequence-to-sequence training. The Docker image will be hosted on the Amazon Elastic Container Registry (Amazon ECR). We will use this image to perform our model training.

# Login to ECR, create an image repository
$ ACCOUNTID=`aws iam get-user|grep Arn|cut -f6 -d:`
$ `aws ecr get-login --no-include-email --region us-west-2`
$ aws ecr create-repository --repository-name tensorflow-notebook-gpu --region us-west-2
 
$ curl -o train.py https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/notebooks/train.py
$ curl -o seq2seq_utils.py https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/notebooks/seq2seq_utils.py
 
# Build, tag and push Jupyter notebook docker image to ECR
$ docker build -t $ACCOUNTID.dkr.ecr.us-west-2.amazonaws.com/tensorflow-notebook-gpu:0.1 . -f-<<EOF
FROM gcr.io/kubeflow-images-public/tensorflow-1.8.0-notebook-gpu
RUN pip install ktext annoy sklearn h5py nltk pydot
COPY train.py /workdir/train.py
COPY seq2seq_utils.py /workdir/seq2seq_utils.py
EOF
 
$ docker push $ACCOUNTID.dkr.ecr.us-west-2.amazonaws.com/tensorflow-notebook-gpu:0.1

Launch the Jupyter Notebook

Connect to the JupyterHub and spin up a new Jupyter notebook. JupyterHub can be accessed at http://localhost:8080 with a browser by port-forwarding the tf-hub-lb service.

$ kubectl port-forward svc/tf-hub-lb -n ${NAMESPACE} 8080:80

Enter the following in the Spawner Options:

Image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/tensorflow-notebook-gpu:0.1

(This is the docker image that was built in the previous step. Replace 123456789012 with your $ACCOUNTID value.)

Extra Resource Limits: {"nvidia.com/gpu": “1”}
(This setting will configure one GPU to the Jupyter notebook.)

Spawner Options

A Jupyter notebook with a pod name jupyter-${username} is spawned with a Persistent Volume and one GPU resource. Run the following command to confirm that the pod is running:

$ kubectl get pod -n ${NAMESPACE}
 
NAME                                      READY     STATUS    RESTARTS   AGE
ambassador-585dd7b87-4fz2l                2/2       Running   0          26m
ambassador-585dd7b87-dlh9j                2/2       Running   0          26m
ambassador-585dd7b87-v45t2                2/2       Running   0          26m
centraldashboard-7d7744cccb-q5mzl         1/1       Running   0          26m
jupyter-enghwa                            1/1       Running   1          4m
tf-hub-0                                  1/1       Running   0          26m
tf-job-dashboard-bfc9bc6bc-xhpn2          1/1       Running   0          26m
tf-job-operator-v1alpha2-756cf9cb97-9pp4z 1/1       Running   0          26m

Perform Machine Learning to Train Our Model

Once the Jupyter notebook is ready, launch a Terminal inside the Jupyter notebook (Files → New → Terminal) and clone the kubeflow example repository:

git clone https://github.com/kubeflow/examples

cloning a repository in Jupyter

The examples folder will now show up in the Jupyter notebook. Launch the Training.ipynb notebook in the examples/github_issue_summarization/notebooks folder.

Jupyter training notebook

This notebook will download the GitHub issues dataset and perform sequence-to-sequence training. At the end of the training, a Keras model seq2seq_model_tutorial.h5 will be produced. The GPU will be used to speed up the training (training one million rows takes about 15 minutes instead of a few hours, as it would on a standard CPU).

Before we run the notebook, make the following two changes:

Cell 3: Change the DATA_DR to /home/jovyan/github-issues-data
Cell 7: Change training_data_size from 2000 to 1000000. This increased training data size will improve the prediction result. You can also use the full dataset (~4.8M rows), which will take about 1 hour to train.

Jupyter edit

Start the training in the Jupyter notebook with Cell -> Run All.

Start the training in the Jupyter notebook with *Cell -> Run All*.

Once the training is completed, the model is saved in the Jupyter notebook’s pod. (Note: You can safely ignore the error in the BLEU Score evaluation). To serve this model as a microservice over a REST API, the following steps are needed:

Create a model-serving microservice image called “github-issue-summarization” with the python code IssueSummarization.py using Seldon Core’s Python wrapper.
Copy the model files from the Jupyter notebook’s pod to this model-serving microservice image.
Run this model-serving microservice image with Seldon Core.

Build the Seldon Core Microservice Image

To build the model-serving microservice image, we will clone the github_issue_summarization from Kubeflow example repository. The steps are as follows:

Clone the Kubeflow example repository for the necessary python files in the “github_issue_summarization/notebooks” directory to serve the model.
Execute the Seldon Core’s python wrapper script to prepare a Docker build directory for microservice image.
Copy the trained model’s files from the Jupyter notebook’s pod to the build directory so that the Docker build can package these files into the microservice image.
Build the microservice image and push it to Amazon ECR.

The following commands accomplish these steps.

$ git clone https://github.com/kubeflow/examples serve/
$ cd serve/github_issue_summarization/notebooks
 
$ docker run -v $(pwd):/my_model seldonio/core-python-wrapper:0.7 /my_model IssueSummarization 0.1 gcr.io --base-image=python:3.6 --image-name=gcr-repository-name/issue-summarization
 
$ cd build/
# fix directory permission 
$ sudo chown `id -u` . 
$ PODNAME=`kubectl get pods --namespace=${NAMESPACE} --selector="app=jupyterhub" --output=template --template="{{with index .items 0}}{{.metadata.name}}{{end}}"`
 
$ kubectl --namespace=${NAMESPACE} cp ${PODNAME}:/home/jovyan/examples/github_issue_summarization/notebooks/seq2seq_model_tutorial.h5 .
$ kubectl --namespace=${NAMESPACE} cp ${PODNAME}:/home/jovyan/examples/github_issue_summarization/notebooks/body_pp.dpkl .
$ kubectl --namespace=${NAMESPACE} cp ${PODNAME}:/home/jovyan/examples/github_issue_summarization/notebooks/title_pp.dpkl .
 
# build and push microservice image to Amazon ECR
$ aws ecr create-repository --repository-name github-issue-summarization --region us-west-2
$ docker build --force-rm=true -t $ACCOUNTID.dkr.ecr.us-west-2.amazonaws.com/github-issue-summarization:0.1 .
$ docker push $ACCOUNTID.dkr.ecr.us-west-2.amazonaws.com/github-issue-summarization:0.1

Serve Our prediction as a Seldon Core Microservice

Install Seldon Core

A Seldon Core prototype is shipped with Kubeflow. Execute the following ksonnet commands inside the kubeflow_ks_app directory to generate the Seldon Core component and deploy Seldon Core:

$ ks generate seldon seldon --name=seldon
$ ks apply default -c seldon

Verify that Seldon Core is running by running kubectl get pods -n${NAMESPACE}. You should see a pod named seldon-cluster-manager-*.

Kubeflow includes a component to serve Seldon Core microservices. Using these ksonnet commands, the github-issue-summarization microservice image created previously will be deployed as a Kubernetes deployment with two replicaSets.

$ ks generate seldon-serve-simple issue-summarization-model-serving \
--name=issue-summarization \
--image=$ACCOUNTID.dkr.ecr.us-west-2.amazonaws.com/github-issue-summarization:0.1 \
--replicas=2
$ ks apply default -c issue-summarization-model-serving

Testing the Prediction REST API

Seldon Core uses the ambassador API gateway to route requests to the microservice. Run these commands to port-forward the ambassador service to localhost:8081 and test the summary prediction REST API.

$ kubectl port-forward svc/ambassador -n ${NAMESPACE} 8081:80

Let’s generate a summary prediction of a sample GitHub issue using curl to POST to the REST API. As shown below, the summary of our long text of GitHub issue is being predicted by our model as example of how to use it.


 
$ curl -X POST -H 'Content-Type: application/json' -d '{"data":{"ndarray":[["There is lots of interest in serving with GPUs but we do not have a good example showing how to do this. I think it would be nice to have one. A simple example might be inception with a simple front end that allows people to upload images for classification."]]}}' http://localhost:8081/seldon/issue-summarization/api/v0.1/predictions
 
{
  "meta": {
    "puid": "2f9qdrbkro67lh93audeve9p60",
    "tags": {
    },
    "routing": {
    }
  },
  "data": {
    "names": ["t:0"],
    "ndarray": [["example of how to use it"]]
  }
}

Summary

In this post, we first deployed Kubeflow on Amazon EKS with GPU worker nodes. We then walked through a typical data scientist’s workflow of training a machine learning model using a Jupyter notebook and then serving it as a microservice on Kubernetes.

To clean up, run kubectl delete namespace ${NAMESPACE} to delete all the resources created under the kubeflow namespace.

You can continue your exploration of Kubeflow on EKS in our open source Kubernetes and Machine Learning workshop.