AWS Open Source Blog

Enterprise-ready Kubeflow: Securing and scaling AI and machine learning pipelines with AWS

NOTE: Since this blog post was written, much about Kubeflow has changed. While we are leaving it up for historical reference, more accurate information about Kubeflow on AWS can be found here.

Many AWS customers are building AI and machine learning pipelines on top of Amazon Elastic Kubernetes Service (Amazon EKS) using Kubeflow across many use cases, including computer vision, natural language understanding, speech translation, and financial modeling. In this post, we will describe AWS contributions to the Kubeflow project, which provide enterprise readiness for Kubeflow deployments.

Originally open sourced in December 2017, the Kubeflow project reached its 1.0 milestone in March 2020. With this release, Kubeflow has graduated key components of the build, train, optimize, and deploy user journey for machine learning. These components include the Kubeflow dashboard UI, multi-user Jupyter Notebooks, Kubeflow Pipelines, and KFServing, as well as distributed training operators for TensorFlow, PyTorch, and XGBoost.

Our contributions to Kubeflow help to democratize machine learning, streamline data science tasks, and allow customers to leverage the highly optimized, cloud-native, enterprise-ready AWS services with Kubeflow. Customers have a clear path to use Kubeflow with Amazon EKS for managed Kubernetes clusters, Amazon Simple Storage Service (Amazon S3) for object storage, Amazon Relational Database Service (Amazon RDS) for pipeline metadata, Amazon Elastic File System (Amazon EFS) for shared file access, Amazon FSx for Lustre for increased training performance, Amazon CloudWatch for logging/metrics, and Amazon SageMaker for AI and machine learning integration.

Kubeflow logo surrounded by AWS logos

Security

As security is a top priority at AWS, we have tightly integrated the Kubeflow security model directly into the AWS shared-responsibility security services. Integrations include IAM Roles for Service Account for fine-grained access control at the Kubernetes Pod level, Application Load Balancing (ALB) for external traffic management and authentication, AWS Shield for DDoS protection, AWS Certificate Manager (ACM) for in-transit encryption, AWS Key Management Service (AWS KMS) for at-rest encryption, and Amazon Cognito for user management.

Kubeflow users can configure an Application Load Balancer to securely authenticate users either through Amazon Cognito or through an identity provider (IdP) that is OpenID Connect (OIDC) compliant. When you create a Kubeflow profile using Kubeflow’s profile controller, an AWS Identity and Access Management (IAM) role binds to a Kubernetes service account in the user’s namespace. This seamlessly grants AWS permissions to the user. Additionally, Istio and Kubernetes RBAC are created along with the profile creation. RBAC authorizes and isolates users to specific Kubernetes resources. By deploying Kubeflow on Amazon EKS, customers can enable private cluster-endpoint access to keep traffic within their Virtual Private Cloud (VPC) and completely disable public access from the internet.

Compute, autoscaling, and Spot Instances

Amazon CloudWatch logs and metrics allow customers to easily create dashboards and alerts to monitor Kubeflow resources, such as the health of Kubeflow Pipelines and performance of TensorFlow/PyTorch/MXNet models. Customers can choose from a variety of CPU and GPU instance types available on Amazon Elastic Compute Cloud (Amazon EC2) to power their Kubeflow workloads depending on their business needs. Kubeflow running on Amazon EKS will automatically detect GPUs and install the appropriate GPU device plugin on each instance.

Additionally, Amazon EKS supports Spot Instances and cluster autoscaling with Kubeflow. Using Spot Instances, customers can save up to 90% over on-demand instances. Cluster autoscaling will dynamically increase or decrease the number of nodes in your Kubeflow cluster based on resource utilization. AWS has committed a number of improvements around the GPU autoscaling and Spot Instance user experience with Cluster Autoscaler.

AI and machine learning pipelines

Kubeflow Pipelines can automate complex AI and machine learning pipelines using custom components available for many AWS services, including Amazon Athena, Ground Truth, Amazon EMR, and Amazon SageMaker. For example, a typical pipeline might include data ingestion with Amazon Athena, data labeling with Ground Truth, feature engineering with Apache Spark on Amazon EMR, and model training/deploying with SageMaker.

In June 2020, we open sourced SageMaker Components for Kubeflow Pipelines to help customers create best-of-breed AI/ML pipelines to train, tune, and deploy models with Kubeflow and Amazon SageMaker.

Storage and distributed file systems

Kubeflow builds upon Kubernetes to provide a solid infrastructure for large-scale, distributed data processing, including AI/ML model training and tuning. Because distributed processing often requires a distributed file system, AWS provides multiple high-performance, cloud-native, distributed file systems, including Amazon Elastic File System (Amazon EFS) and Amazon FSx for Lustre. These POSIX-compliant file systems are optimized for large-scale and compute-intensive workloads, including high-performance computing (HPC), AI, and machine learning. Kubeflow leverages the Kubernetes-native Container Storage Interfaces (CSI) drivers for Amazon EFS and Amazon FSx.

When performing Kubeflow experiments and hyper-parameter tuning jobs, customers can now store metadata and artifacts directly into Amazon RDS and Amazon S3 object storage for better performance, scalability, and durability. Storing data in Amazon RDS and Amazon S3 adds stability to your Kubeflow cluster—even across Kubeflow version upgrades. You can find more information in the AWS Storage Options, Configuring RDS, and Using S3 for Pipeline Artifacts sections of the Kubeflow documentation.

What’s next

We have a full roadmap of improvements to the Kubeflow experience on AWS. Highlights we plan to include in future releases include:

  • Provide simple Kubeflow installation and management using AWS CloudFormation.
  • Streamline the end-to-end experience for building, training, tuning, and deploying AI/ML models.
  • Integrate Feast feature store with Kubeflow on Amazon EKS.
  • Graduate MXNet Operator to 1.0 and add more production-grade features to the TensorFlow and PyTorch operators.
  • Build more data-processing components to integrate with additional AWS services.

Summary

In this post, we highlighted how AWS customers can use Kubeflow with native AWS-managed services for secure, scalable, and enterprise-ready AI/ML workloads. We encourage you to set up Kubeflow on Amazon EKS by following our EKS + Kubeflow workshop and sample notebooks and pipelines.

For more information on Kubeflow on AWS, check out the Kubeflow documentation. For more information on Kubeflow and Amazon SageMaker, review the SageMaker Components for Kubeflow Pipelines documentation. You can also find us on the Kubeflow #AWS Slack Channel, and we welcome your feedback there because it helps up prioritize the next features to contribute to the Kubeflow project. And lastly, please join us for Kubeflow and AWS monthly community events online.

TAGS:
Chris Fregly

Chris Fregly

Chris is a Principal Specialist Solution Architect for AI and machine learning at Amazon Web Services (AWS) based in San Francisco, California. He is co-author of the O'Reilly Book, "Data Science on AWS." Chris is also the Founder of many global meetups focused on Apache Spark, TensorFlow, Ray, and KubeFlow. He regularly speaks at AI and machine learning conferences across the world including O’Reilly AI, Open Data Science Conference, and Big Data Spain.

Jiaxin Shan

Jiaxin Shan

Jiaxin Shan is a Software Engineer for Amazon EKS, leading initiative of big data and machine learning adoption on Kubernetes. He's an active Kubernetes and Kubeflow contributor and he spend most time in sig-autoscaling, ug-bigdata, wg-machine-learning and sig-scheduling.