AWS Open Source Blog

Scaling AI and Machine Learning Workloads with Ray on AWS

Logos for Ray, Amazon SageMaker, Amazon EC2, Amazon EKS, and Amazon EMR

Many AWS customers are running open source Ray-based AI and machine learning workloads on AWS using Amazon SageMaker, Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon EMR across many use cases including data analytics, natural language understanding, computer vision, and time-series modeling.

In this post, we will describe AWS contributions to the Ray community to enable enterprise-scale AI and machine learning deployments with Ray on AWS. In addition, we will highlight various AWS services that help improve security, performance, and operational efficiency when running Ray on AWS.

This month, the Ray project reached its 2.0 milestone. With this release, the Ray community has improved many key components of the interactive data science user journey including model training, tuning, and serving. Ray 2.x improves stability, usability, and performance across many Ray features including the Jobs API, Dataset API, AI Runtime (AIR), and Ray Dashboard UI as shown here.

Ray dashboard

Amazon.com and AWS have both worked with the Ray community to improve the scalability of Ray and integrate Ray with many AWS services including AWS Identity and Access Management (IAM) for fine-grained access control, AWS Certificate Manager (ACM) for SSL/TLS in-transit encryption, AWS Key Management Service (AWS KMS) for at-rest encryption, Amazon Simple Storage Service (Amazon S3) for object storage, Amazon Elastic File System (Amazon EFS) for distributed-file access, and Amazon CloudWatch for observability and autoscaling.

These contributions and service integrations allow AWS customers to scale their Ray-based workloads utilizing secure, cost-efficient, and enterprise-ready AWS services across the complete end-to-end AI and machine learning pipeline with both CPUs and GPUs as shown in the heterogeneous Ray cluster-configuration for Amazon EC2 here:

cluster_name: cluster

provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e

available_node_types:
    # CPU memory-optimized instance type for the leader node
    ray.head.default:
        node_config:
            InstanceType: r5dn.4xlarge
            ...
    # GPU compute-optimized instance type for the worker nodes            
    ray.worker.default:
        node_config:
            InstanceType: g5.48xlarge
            ...

Storage and distributed file systems

Ray’s highly-distributed processing engine can utilize any of the AWS high-performance, cloud-native, distributed file systems including Amazon EFS and Amazon FSx for Lustre. These POSIX-compliant file systems are optimized for large-scale and compute-intensive workloads including high-performance computing (HPC), AI, and machine learning. Earlier this year, we worked with the Ray community to increase I/O performance of Parquet-file reads from both local disk and cloud storage. We also contributed path-partition filtering support to improve I/O performance by only reading data that matches a given filter.

Observability and autoscaling

Our contributions to Ray for Amazon CloudWatch logs and metrics allow customers to easily create dashboards and monitor the memory and CPU/GPU utilization of Ray clusters as shown here:

CPU and GPU utilization on Ray
Using resource-utilization data from Amazon CloudWatch, Ray can dynamically increase or decrease the number of compute resources in your cluster – including scale-to-0 to minimize cost when the cluster is not being utilized. Here is an example of a Ray autoscaling configuration on AWS including minimum/maximum number of workers as well as a scaling factor (upscaling_speed) which represents the rate of scaling up:

cluster_name: ray_cluster 
upscaling_speed: 1.2
...
available_node_types:
    ray.worker.default:
        node_config:
            InstanceType: g5.48xlarge
        min_workers: 5
        max_workers: 10

End-to-end AI and machine learning pipeline

Ray on AWS is used across all steps of the machine learning workflow including data analysis with Pandas on Ray (Modin) using Amazon EC2, feature engineering with Apache Spark on Ray (RayDP) using Amazon EMR, and model training/tuning with HuggingFace, PyTorch and TensorFlow using Amazon SageMaker and GPUs as shown here including an model-checkpointing in Amazon S3.

import ray
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig, RunConfig
from ray.tune import SyncConfig

# Define training loop        
def train_func():
    ...
    # Start with a pre-trained BERT model
    model_name_or_path = "roberta-base" 

    # Number of predicted classes 
    # (ie. "Negative", "Neutral", "Positive")
    num_labels = 3 
    
    config = AutoConfig.from_pretrained(
        model_name_or_path, num_labels=num_labels, 
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_name_or_path, use_fast=True
    )
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name_or_path,
        config=config,
    )
    ...

# Initialize ray session
ray.init(address="ray://<hostname>:10001",
         runtime_env={
                        "pip": [
                            "torch", 
                            "scikit-learn",
                            "transformers",
                            "datasets"
                        ]
                     },
         # Retrieve input data from S3
         working_dir="s3://<s3_bucket>/data")
        
# Create Trainer
trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    train_loop_config={
                        "batch_size": 64,
                        "epochs": 10,
                        "use_gpu": True
                      },
    # Increase num_workers to scale out the cluster
    scaling_config=ScalingConfig(num_workers=20),
    run_config = RunConfig(
        sync_config=SyncConfig(
            # Store checkpoints to S3
            upload_dir="s3://<s3_bucket>/checkpoint"
        )
    )
)

# Launch training job and print results
results = trainer.fit()
print(results.metrics)

With Ray on AWS, customers can orchestrate their Ray-based machine learning workflows using Amazon SageMaker Pipelines, Amazon Step Functions, Apache Airflow, or Ray Workflows. Customers can also track experiments using SageMaker Experiments or MLflow as shown here:

MLFlow

Kubernetes and the KubeRay project

Amazon EKS supports Ray on Kubernetes through the KubeRay EKS Blueprint, contributed by the Amazon EKS team, that quickly deploys a scalable and observable Ray cluster on your Amazon EKS cluster. As compute-demand increases or decreases, Ray works with the Kubernetes-native autoscaler to resize the Amazon EKS cluster as needed. Here is an example of a Grafana dashboard from a Ray cluster of 2 nodes created with the KubeRay EKS Blueprint:

Grafana dashboard for Ray

Summary

In this post, we highlighted AWS contributions to the scalability and operational efficiency of Ray on AWS. We also showed how AWS customers use Ray with AWS-managed services for secure, scalable, and enterprise-ready workloads across the entire data processing and AI/ML pipeline. Going forward, we will continue to work closely with the Ray community to improve Ray’s resilience and large-scale data processing – as well as integrate more AWS services for enhanced networking, data streaming, job queuing, machine learning, and much more!

We encourage you to set up Ray on AWS by following our samples for Amazon SageMaker, Amazon EC2, Amazon EKS, and Amazon EMR. For more information on Ray on AWS, check out the Ray-AWS documentation. If you are interested in improving Ray on AWS, please join the Ray community and send us a message on Slack. We welcome your feedback because it helps up prioritize the next features to contribute to the Ray community. And lastly, please join us for Ray and AWS monthly community events online.

Chris Fregly

Chris Fregly

Chris is a Principal Specialist Solution Architect for AI and machine learning at Amazon Web Services (AWS) based in San Francisco, California. He is co-author of the O'Reilly Book, "Data Science on AWS." Chris is also the Founder of many global meetups focused on Apache Spark, TensorFlow, Ray, and KubeFlow. He regularly speaks at AI and machine learning conferences across the world including O’Reilly AI, Open Data Science Conference, and Big Data Spain.

Antje Barth

Antje Barth

Antje Barth is a Principal Developer Advocate for generative AI at AWS. She is co-author of the O’Reilly books Generative AI on AWS and Data Science on AWS. Antje frequently speaks at AI/ML conferences, events, and meetups around the world. She also co-founded the Düsseldorf chapter of Women in Big Data.

Daniel Yeo

Daniel Yeo

Daniel Yeo is a Senior Technical Program Manager at Amazon. He is passionate about advancing technologies to make machine learning scale seamlessly. His team is actively contributing improvements and novel ideas to Ray in Open Source, so customers can reap the full potential of using Ray.

Apoorva Kulkarni

Apoorva Kulkarni

Apoorva is a Sr. Specialist Solutions Architect, Containers, at AWS where he helps customers who are building modern ML platforms on AWS container services.

Patrick Ames

Patrick Ames

Patrick Ames is a Principal Engineer working on data management and optimization for big data technologies at Amazon.

Simon Zamarin

Simon Zamarin

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Yiqin(Miranda) Zhu

Yiqin(Miranda) Zhu

Miranda is a Software Development Engineer in the Ray team at Amazon. She is passionate about developing Open Source Ray and integrating Ray with Amazon technologies.