AWS Storage Blog

Using high-performance storage for machine learning workloads on Kubernetes

Organizations are modernizing their applications by adopting containers and microservices-based architectures. Many customers are deploying high-performance workloads on containers to power microservices architecture, and require access to low latency and high throughput shared storage from these containers. Because containers are transient in nature, these long-running applications require data to be stored in durable storage.

Amazon FSx for Lustre (FSx for Lustre) provides the world’s most popular high-performance file system, fully managed and integrated with Amazon S3. It offers a POSIX-compliant, fast parallel file system to enable peak performance and highly durable storage for your Kubernetes workloads. FSx for Lustre allows you to spin up a high-performance file system in minutes, getting rid of the traditional complexity of setting up and managing Lustre file systems. FSx for Lustre provides sub-millisecond latencies, up to hundreds of gigabytes per second of throughput, and millions of IOPS. Customers use FSx for Lustre for workloads where speed matters, such as machine learning (ML), high performance computing (HPC), video processing, and financial modeling.

Kubernetes is an open-source container-orchestration system for automating the deployment, scaling, and management of containerized applications. AWS makes it easy to run Kubernetes without needing to install and operate your own Kubernetes control plane or worker nodes using our managed service Amazon Elastic Kubernetes Service (Amazon EKS). Amazon EKS runs Kubernetes control plane instances across multiple Availability Zones to ensure high availability. Amazon EKS automatically detects and replaces unhealthy control plane instances, and it provides automated version upgrades and patching for them.

In this post, I will introduce you to my GitHub tutorial where I will cover how to provision an FSx for Lustre persistent file system with an Amazon EKS cluster and accelerate your ML training using FSx for Lustre and Amazon SageMaker. While I am focusing on an ML use case in this blog, FSx for Lustre persistent file systems can be used with any high-performance workload on Amazon EKS clusters where applications need access to a shared, persistent, and high-performance POSIX-compliant file system.

Basic components of Kubernetes containers

First, let’s review some basic components of Kubernetes cluster and why we need shared persistent storage. A Pod is the basic execution unit of a Kubernetes application and comprises of one or more containers with shared storage/network, and a specification for how to run containers. A Pod always runs on a Node and each Node is managed by a Kubernetes Master. A Node is a worker machine in Kubernetes and may be either a virtual or a physical machine. A Node can have multiple pods, and the Kubernetes master automatically handles scheduling the pods across the Nodes in the cluster. These set of Nodes together with components that represent the control plane form a Kubernetes cluster.

A Pod can use two types of volumes to store data: regular and persistent volumes. Regular volumes on Kubernetes clusters are deleted when the Pod hosting them shuts down. Regular volumes are useful for storing temporary data that does not need to exist outside of the pod’s lifetime. A persistent volume is a cluster-wide resource that you can use to store data beyond the lifetime of a pod. A persistent volume is hosted in its own Pod and can remain alive for as long as necessary for ongoing operations. A Pod can specify a set of shared storage volumes. All containers in the Pod can access the shared volumes, allowing those containers to share data. AWS offers customers a choice to provision persistent volumes using container storage interface (CSI) drivers for Amazon EBS, Amazon EFS, and FSx for Lustre.

FSx for Lustre persistent file systems

Earlier this year, we announced availability of the persistent storage file system deployment option for FSx for Lustre. The persistent file system option provides highly available and durable storage for workloads that run for extended, or indefinite, periods of time and are sensitive to disruptions.

FSx for Lustre stores data across multiple network file servers to maximize performance and reduce bottlenecks. These file servers have multiple disks. If a file server becomes unavailable on a persistent file system, it is replaced automatically within minutes of failure. During that time, client requests for data on that server transparently retry and eventually succeed after the file server is replaced. Data on persistent file systems is replicated on disks and any failed disks are automatically replaced, transparently.

We recommend using Amazon FSx persistent file system option to provision persistent storage for your Kubernetes clusters. The FSx for Lustre CSI driver provides an interface that allows Amazon EKS clusters to manage the lifecycle of Amazon FSx for Lustre file systems.

SageMaker Operator for Kubernetes

Next, let’s review how you can run ML workloads using Amazon SageMaker Operators for Kubernetes. Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.

Amazon SageMaker Operators for Kubernetes makes it easier for developers and data scientists using Kubernetes to train, tune, and deploy ML models in Amazon SageMaker. You can install SageMaker Operators on your Kubernetes cluster in Amazon EKS to create SageMaker jobs natively using the Kubernetes API and command line Kubernetes tools such as ‘kubectl’. The following diagram shows high-level architecture for the use case I will cover in my GitHub tutorial.

This diagram shows the high-level architecture for the use case covered in the GitHub tutorial

GitHub tutorial

Finally, let’s bring these concepts to life in a detailed step-by-step tutorial that covers:

  • Deploying an Amazon EKS cluster and configuring Amazon SageMaker Operators for Kubernetes on this EKS cluster.
  • Installing CSI driver for Kubernetes and provisioning a persistent FSx for Lustre high-performance file system.
  • Creating an Amazon SageMaker training job using FSx for Lustre persistent file system as your input data source.
  • Training a gradient-boosting model in Amazon SageMaker using the Amazon SageMaker operator.

To get started, launch the GitHub tutorial here: Using a high-performance persistent storage for machine learning workloads on Kubernetes!

Summary

In this blog, I introduced you to my GitHub tutorial where I showed you how to use an Amazon FSx for Lustre persistent file system with Amazon SageMaker to train an ML model on an Amazon EKS cluster. In the tutorial, I first set up SageMaker Operator on your Kubernetes cluster. Next, I configured an FSx for Lustre persistent file system as a persistent volume using the CSI driver on our Amazon EKS cluster. Then, I configured the training job to use FSx for Lustre for your input data source, and initiated training on a gradient-boosting model using the Amazon SageMaker Training Operator.

Using FSx for Lustre accelerates your training jobs by enabling faster download of large datasets. Subsequent training jobs can use the dataset already available on an Amazon FSx file system and avoid repeated Amazon S3 requests costs. FSx for Lustre persistent file systems can be used with any high-performance workload on Amazon EKS clusters when applications need access to a shared, persistent, and high-performance POSIX-compliant file system.

Thank you for reading this blog post. Please leave a comment if you have any questions or feedback.