Machine Learning with Kubeflow on Amazon EKS with Amazon EFS

Training Machine Learning models involves multiple steps, it gets more complex and time consuming when the size of the data set for training is in the range of 100s of GBs. Data Scientists run through large number of experiments and research which includes testing and training large number of models. Kubeflow provides various ML capabilities to accelerate the training process and run simple, portable scalable Machine Learning workloads on Kubernetes.

Model parallelism is a distributed training method in which the deep learning model is partitioned across multiple devices, within or across instances. When Data Scientists adopt Model parallelism there’s also a need to share the large dataset across Machine Learning models.

In part 1 of this two-part blog series, we covered persistent storage for Kubernetes and an example workload that used Amazon Elastic Kubernetes Service (EKS) with Amazon Elastic File System (EFS) as persistent storage.

In this blog, we will walk through how you can use Kubeflow on Amazon EKS to implement model parallelism and use Amazon EFS as persistent storage to share datasets. You can use Kubeflow to build ML systems on top of Amazon EKS to build, train, tune, and deploy ML models for a wide variety of use cases, including computer vision, natural language processing, speech translation, and financial modeling. And with the use of Amazon EFS as the backend storage you can get better performance for your model training and inference.

Solution overview

The architecture uses Amazon EKS as a compute layer, wherein we create different `pods` to perform our ML training jobs and use Amazon EFS as the storage layer to store our training datasets. We use Amazon ECR as the image repository that Amazon EKS uses to store the ML training container images.

Architecture of Kubeflow on Amazon EKS with Amazon EFS

Figure 1: Architecture of Kubeflow on Amazon EKS with Amazon EFS

As part of this architecture, we will do the following:

Install and configure Kubeflow on Amazon EKS.
Set up Amazon EFS as persistent storage with Kubeflow.
Create a Jupyter notebook on Kubeflow.
Perform a machine learning training using an ML (TensorFlow) image from Amazon ECR.

Prerequisites

Complete the following steps to create EKS Cluster and install the necessary tools required.

Set up the initial Cloud9 setup as per the tutorial.
Install these pre-requisite tools:

Creating the Kubernetes cluster using Amazon EKS following the instructions here.

Verify your EKS cluster by running the following command:

$ kubectl get nodes -o=wide

Install and configure Kubeflow

We use the Kustomize tool to install Kubeflow. Use the following commands to install Kustomize (version 3.2.0):

$ wget -O kustomize https://github.com/kubernetessigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64

$ chmod +x kustomize

$ sudo mv -v kustomize /usr/local/bin

Verify if Kustomize is installed properly:

$ kustomize version

Use the following commands to set up Kubeflow to deliver end-end workflow for training ML models:

$ git clone https://github.com/aws-samples/amazon-efs-developer-zone.git

$ cd amazon-efs-developer-zone/application-integration/container/eks/kubeflow/manifests

$ while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Verify the installation ensuring the Pods are in running state:

$ kubectl get pods -n cert-manager

$ kubectl get pods -n istio-system

$ kubectl get pods -n auth

$ kubectl get pods -n knative-eventing

$ kubectl get pods -n knative-serving

$ kubectl get pods -n kubeflow

$ kubectl get pods -n kubeflow-user-example-com

Set up Amazon EFS as persistent storage with Kubeflow

Let’s look at the detailed steps involved in setting up the persistent storage.

1. Create an OIDC provider for the cluster

$ export CLUSTER_NAME=efsworkshop-eksctl
 $ eksctl utils associate-iam-oidc-provider --cluster   $CLUSTER_NAME --approve

2. Setup EFS using automated script auto-efs-setup.py. The script applies some default values for the file system name, performance mode and executes the following:

- Install the EFS CSI Driver
- Create the IAM Policy for the CSI Driver
- Create an EFS Filesystem
- Creates a Storage Class for the cluster

3. Run the auto-efs-setup.py

$ cd ml/efs
$ pip install -r requirements.txt
$ python auto-efs-setup.py --region $AWS_REGION --cluster $CLUSTER_NAME --efs_file_system_name myEFS1
================================================================
                          EFS Setup
================================================================
                   Prerequisites Verification
================================================================
Verifying OIDC provider...
OIDC provider found
Verifying eksctl is installed...
eksctl found!
...
...
Setting up dynamic provisioning...
Editing storage class with appropriate values...
Creating storage class...
storageclass.storage.k8s.io/efs-sc created
Storage class created!
Dynamic provisioning setup done!
================================================================
                      EFS Setup Complete
================================================================

4. Verify the Storage Class in the Kubernetes cluster

$ kubectl get sc

NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
efs-sc          efs.csi.aws.com         Delete          WaitForFirstConsumer   true                   96s
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  148m

Creating a Jupyter Notebook on Kubeflow

Run the following to port-forward Istio’s Ingress-Gateway to local port 8080:

$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
...
...

2. In the Cloud9 console, select Tools > Preview > Preview Running Application to access dashboard. You can click on pop out window button to maximize the browser into new tab.

AWS Cloud9 Cloud-Based IDE

Figure 2: AWS Cloud9 cloud-based IDE

- Keep the current terminal running so you don’t lose access to the UI page.

3. Login to the Kubeflow dashboard with the default user credentials. The default email address is user@example.com and the default password is 12341234.

Kubeflow Dashboard

Figure 3: Kubeflow dashboard

4. Create a Jupyter Notebook by selecting on Notebook then New Server.

5. Name the notebook as “notebook1”, keep the rest of the settings at default and scroll down to click on LAUNCH.

Kubeflow Dashboard (Notebook Creation)

Figure 4: Kubeflow Dashboard (Notebook Creation)

At this point the EFS CSI driver should create an Access Point, as this new notebook on Kubeflow will internally create a pod, which then creates a PVC that in turn calls the storage from the storage class efs-sc (as that’s the default storage we have selected for this EKS cluster). Wait for the notebook to come to ready state.

Kubeflow Dashboard (Notebook Server)

Figure 5: Kubeflow dashboard (Notebook server)

6. Now we can check the PV (Persistent Volume), PVC (Persistent Volume Claim) using `kubectl` which got created under the hood.

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                           STORAGECLASS   REASON   AGE
pvc-3d1806bc-984c-404d-9c2a-489408279bad   20Gi       RWO            Delete           Bound    kubeflow/minio-pvc                              gp2                     52m
pvc-8f638f2c-7493-461c-aee8-984760e233c2   10Gi       RWO            Delete           Bound    kubeflow-user-example-com/workspace-nootbook1   efs-sc                  5m16s
pvc-940c8ebf-5632-4024-a413-284d5d288592   10Gi       RWO            Delete           Bound    kubeflow/katib-mysql                            gp2                     52m
pvc-a8f5e29f-d29d-4d61-90a8-02beeb2c638c   20Gi       RWO            Delete           Bound    kubeflow/mysql-pv-claim                         gp2                     52m
pvc-af81feba-6fd6-43ad-90e4-270727e6113e   10Gi       RWO            Delete           Bound    istio-system/authservice-pvc                    gp2                     52m

$ kubectl get pvc -n kubeflow-user-example-com
NAME                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
workspace-nootbook1   Bound    pvc-8f638f2c-7493-461c-aee8-984760e233c2   10Gi       RWO            efs-sc         5m59s

Finally, you should be able to see the access point in the AWS console.

Amazon EFS UI Console (for Access Point)

Figure 6: Amazon EFS console (for Access Point)

So, at this point you can make use of this Jupyter Notebook.

Perform a Machine Learning training

Next, let’s create a PVC for our machine learning training dataset in ReadWriteMany mode. You can go to the Kubeflow Dashboard → Volumes → New Volume and create a new volume called dataset with efs-sc as the storage class

Kubeflow Dashboard to create a new volume

Figure 7: Creating a new volume from the Kubeflow dashboard

Now, you can follow the GitHub to use this persistent volume to store the training dataset and perform a ML training.

Cleaning up

To avoid incurring unwanted future charges, delete the Amazon EKS cluster by completing the steps covered in this documentation.

Conclusion

In this blog post, we walked you through how to set up Kubeflow for your machine learning workflow on Amazon EKS. We also covered how you can use Amazon EFS as shared persistent file system to store your training datasets. We highlighted the value that Kubeflow on AWS provides through native AWS-managed service integrations for secure, scalable, and enterprise-ready AI and ML workloads. To get started with Kubeflow on AWS, refer to the available AWS-integrated deployment options in Kubeflow on AWS and other documentations as mentioned bellow.

In case you are new to Kubernetes and storage integration with Kubernetes, refer to part 1 of this two-part blog series.

For more information, visit the following resources:

AWS Storage Blog