AWS Storage Blog
Machine Learning with Kubeflow on Amazon EKS with Amazon EFS
Training Machine Learning models involves multiple steps, it gets more complex and time consuming when the size of the data set for training is in the range of 100s of GBs. Data Scientists run through large number of experiments and research which includes testing and training large number of models. Kubeflow provides various ML capabilities to accelerate the training process and run simple, portable scalable Machine Learning workloads on Kubernetes.
Model parallelism is a distributed training method in which the deep learning model is partitioned across multiple devices, within or across instances. When Data Scientists adopt Model parallelism there’s also a need to share the large dataset across Machine Learning models.
In part 1 of this two-part blog series, we covered persistent storage for Kubernetes and an example workload that used Amazon Elastic Kubernetes Service (EKS) with Amazon Elastic File System (EFS) as persistent storage.
In this blog, we will walk through how you can use Kubeflow on Amazon EKS to implement model parallelism and use Amazon EFS as persistent storage to share datasets. You can use Kubeflow to build ML systems on top of Amazon EKS to build, train, tune, and deploy ML models for a wide variety of use cases, including computer vision, natural language processing, speech translation, and financial modeling. And with the use of Amazon EFS as the backend storage you can get better performance for your model training and inference.
Solution overview
The architecture uses Amazon EKS as a compute layer, wherein we create different `pods` to perform our ML training jobs and use Amazon EFS as the storage layer to store our training datasets. We use Amazon ECR as the image repository that Amazon EKS uses to store the ML training container images.
Figure 1: Architecture of Kubeflow on Amazon EKS with Amazon EFS
As part of this architecture, we will do the following:
- Install and configure Kubeflow on Amazon EKS.
- Set up Amazon EFS as persistent storage with Kubeflow.
- Create a Jupyter notebook on Kubeflow.
- Perform a machine learning training using an ML (TensorFlow) image from Amazon ECR.
Prerequisites
Complete the following steps to create EKS Cluster and install the necessary tools required.
- Set up the initial Cloud9 setup as per the tutorial.
- Install these pre-requisite tools:
- Creating the Kubernetes cluster using Amazon EKS following the instructions here.
Verify your EKS cluster by running the following command:
$ kubectl get nodes -o=wide
Install and configure Kubeflow
We use the Kustomize tool to install Kubeflow. Use the following commands to install Kustomize (version 3.2.0):
$ wget -O kustomize https://github.com/kubernetessigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
$ chmod +x kustomize
$ sudo mv -v kustomize /usr/local/bin
Verify if Kustomize is installed properly:
$ kustomize version
Use the following commands to set up Kubeflow to deliver end-end workflow for training ML models:
$ git clone https://github.com/aws-samples/amazon-efs-developer-zone.git
$ cd amazon-efs-developer-zone/application-integration/container/eks/kubeflow/manifests
$ while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
Verify the installation ensuring the Pods are in running state:
$ kubectl get pods -n cert-manager
$ kubectl get pods -n istio-system
$ kubectl get pods -n auth
$ kubectl get pods -n knative-eventing
$ kubectl get pods -n knative-serving
$ kubectl get pods -n kubeflow
$ kubectl get pods -n kubeflow-user-example-com
Set up Amazon EFS as persistent storage with Kubeflow
Let’s look at the detailed steps involved in setting up the persistent storage.
1. Create an OIDC provider for the cluster
$ export CLUSTER_NAME=efsworkshop-eksctl
$ eksctl utils associate-iam-oidc-provider --cluster $CLUSTER_NAME --approve
2. Setup EFS using automated script auto-efs-setup.py. The script applies some default values for the file system name, performance mode and executes the following:
-
- Install the EFS CSI Driver
- Create the IAM Policy for the CSI Driver
- Create an EFS Filesystem
- Creates a Storage Class for the cluster
3. Run the auto-efs-setup.py
$ cd ml/efs
$ pip install -r requirements.txt
$ python auto-efs-setup.py --region $AWS_REGION --cluster $CLUSTER_NAME --efs_file_system_name myEFS1
================================================================
EFS Setup
================================================================
Prerequisites Verification
================================================================
Verifying OIDC provider...
OIDC provider found
Verifying eksctl is installed...
eksctl found!
...
...
Setting up dynamic provisioning...
Editing storage class with appropriate values...
Creating storage class...
storageclass.storage.k8s.io/efs-sc created
Storage class created!
Dynamic provisioning setup done!
================================================================
EFS Setup Complete
================================================================
4. Verify the Storage Class in the Kubernetes cluster
$ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
efs-sc efs.csi.aws.com Delete WaitForFirstConsumer true 96s
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 148m
Creating a Jupyter Notebook on Kubeflow
- Run the following to port-forward Istio’s Ingress-Gateway to local port 8080:
$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
...
...
2. In the Cloud9 console, select Tools > Preview > Preview Running Application to access dashboard. You can click on pop out window button to maximize the browser into new tab.
Figure 2: AWS Cloud9 cloud-based IDE
-
- Keep the current terminal running so you don’t lose access to the UI page.
3. Login to the Kubeflow dashboard with the default user credentials. The default email address is user@example.com and the default password is 12341234.
Figure 3: Kubeflow dashboard
4. Create a Jupyter Notebook by selecting on Notebook then New Server.
5. Name the notebook as “notebook1”, keep the rest of the settings at default and scroll down to click on LAUNCH.
Figure 4: Kubeflow Dashboard (Notebook Creation)
At this point the EFS CSI driver should create an Access Point, as this new notebook on Kubeflow will internally create a pod, which then creates a PVC that in turn calls the storage from the storage class efs-sc (as that’s the default storage we have selected for this EKS cluster). Wait for the notebook to come to ready state.
Figure 5: Kubeflow dashboard (Notebook server)
6. Now we can check the PV (Persistent Volume), PVC (Persistent Volume Claim) using `kubectl` which got created under the hood.
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-3d1806bc-984c-404d-9c2a-489408279bad 20Gi RWO Delete Bound kubeflow/minio-pvc gp2 52m
pvc-8f638f2c-7493-461c-aee8-984760e233c2 10Gi RWO Delete Bound kubeflow-user-example-com/workspace-nootbook1 efs-sc 5m16s
pvc-940c8ebf-5632-4024-a413-284d5d288592 10Gi RWO Delete Bound kubeflow/katib-mysql gp2 52m
pvc-a8f5e29f-d29d-4d61-90a8-02beeb2c638c 20Gi RWO Delete Bound kubeflow/mysql-pv-claim gp2 52m
pvc-af81feba-6fd6-43ad-90e4-270727e6113e 10Gi RWO Delete Bound istio-system/authservice-pvc gp2 52m
$ kubectl get pvc -n kubeflow-user-example-com
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
workspace-nootbook1 Bound pvc-8f638f2c-7493-461c-aee8-984760e233c2 10Gi RWO efs-sc 5m59s
Finally, you should be able to see the access point in the AWS console.
Figure 6: Amazon EFS console (for Access Point)
So, at this point you can make use of this Jupyter Notebook.
Perform a Machine Learning training
Next, let’s create a PVC for our machine learning training dataset in ReadWriteMany mode. You can go to the Kubeflow Dashboard → Volumes → New Volume and create a new volume called dataset with efs-sc as the storage class
Figure 7: Creating a new volume from the Kubeflow dashboard
Now, you can follow the GitHub to use this persistent volume to store the training dataset and perform a ML training.
Cleaning up
To avoid incurring unwanted future charges, delete the Amazon EKS cluster by completing the steps covered in this documentation.
Conclusion
In this blog post, we walked you through how to set up Kubeflow for your machine learning workflow on Amazon EKS. We also covered how you can use Amazon EFS as shared persistent file system to store your training datasets. We highlighted the value that Kubeflow on AWS provides through native AWS-managed service integrations for secure, scalable, and enterprise-ready AI and ML workloads. To get started with Kubeflow on AWS, refer to the available AWS-integrated deployment options in Kubeflow on AWS and other documentations as mentioned bellow.
In case you are new to Kubernetes and storage integration with Kubernetes, refer to part 1 of this two-part blog series.
For more information, visit the following resources: