Scaling AI and Machine Learning Workloads with Ray on AWS

Logos for Ray, Amazon SageMaker, Amazon EC2, Amazon EKS, and Amazon EMR

Many AWS customers are running open source Ray-based AI and machine learning workloads on AWS using Amazon SageMaker, Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon EMR across many use cases including data analytics, natural language understanding, computer vision, and time-series modeling.

In this post, we will describe AWS contributions to the Ray community to enable enterprise-scale AI and machine learning deployments with Ray on AWS. In addition, we will highlight various AWS services that help improve security, performance, and operational efficiency when running Ray on AWS.

This month, the Ray project reached its 2.0 milestone. With this release, the Ray community has improved many key components of the interactive data science user journey including model training, tuning, and serving. Ray 2.x improves stability, usability, and performance across many Ray features including the Jobs API, Dataset API, AI Runtime (AIR), and Ray Dashboard UI as shown here.

Ray dashboard

Amazon.com and AWS have both worked with the Ray community to improve the scalability of Ray and integrate Ray with many AWS services including AWS Identity and Access Management (IAM) for fine-grained access control, AWS Certificate Manager (ACM) for SSL/TLS in-transit encryption, AWS Key Management Service (AWS KMS) for at-rest encryption, Amazon Simple Storage Service (Amazon S3) for object storage, Amazon Elastic File System (Amazon EFS) for distributed-file access, and Amazon CloudWatch for observability and autoscaling.

These contributions and service integrations allow AWS customers to scale their Ray-based workloads utilizing secure, cost-efficient, and enterprise-ready AWS services across the complete end-to-end AI and machine learning pipeline with both CPUs and GPUs as shown in the heterogeneous Ray cluster-configuration for Amazon EC2 here:

cluster_name: cluster

provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e

available_node_types:
    # CPU memory-optimized instance type for the leader node
    ray.head.default:
        node_config:
            InstanceType: r5dn.4xlarge
            ...
    # GPU compute-optimized instance type for the worker nodes            
    ray.worker.default:
        node_config:
            InstanceType: g5.48xlarge
            ...

Storage and distributed file systems

Ray’s highly-distributed processing engine can utilize any of the AWS high-performance, cloud-native, distributed file systems including Amazon EFS and Amazon FSx for Lustre. These POSIX-compliant file systems are optimized for large-scale and compute-intensive workloads including high-performance computing (HPC), AI, and machine learning. Earlier this year, we worked with the Ray community to increase I/O performance of Parquet-file reads from both local disk and cloud storage. We also contributed path-partition filtering support to improve I/O performance by only reading data that matches a given filter.

Observability and autoscaling

Our contributions to Ray for Amazon CloudWatch logs and metrics allow customers to easily create dashboards and monitor the memory and CPU/GPU utilization of Ray clusters as shown here:

CPU and GPU utilization on Ray
Using resource-utilization data from Amazon CloudWatch, Ray can dynamically increase or decrease the number of compute resources in your cluster – including scale-to-0 to minimize cost when the cluster is not being utilized. Here is an example of a Ray autoscaling configuration on AWS including minimum/maximum number of workers as well as a scaling factor (upscaling_speed) which represents the rate of scaling up:

cluster_name: ray_cluster 
upscaling_speed: 1.2
...
available_node_types:
    ray.worker.default:
        node_config:
            InstanceType: g5.48xlarge
        min_workers: 5
        max_workers: 10

End-to-end AI and machine learning pipeline

Ray on AWS is used across all steps of the machine learning workflow including data analysis with Pandas on Ray (Modin) using Amazon EC2, feature engineering with Apache Spark on Ray (RayDP) using Amazon EMR, and model training/tuning with HuggingFace, PyTorch and TensorFlow using Amazon SageMaker and GPUs as shown here including an model-checkpointing in Amazon S3.

import ray
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig, RunConfig
from ray.tune import SyncConfig

# Define training loop        
def train_func():
    ...
    # Start with a pre-trained BERT model
    model_name_or_path = "roberta-base" 

    # Number of predicted classes 
    # (ie. "Negative", "Neutral", "Positive")
    num_labels = 3 
    
    config = AutoConfig.from_pretrained(
        model_name_or_path, num_labels=num_labels, 
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_name_or_path, use_fast=True
    )
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name_or_path,
        config=config,
    )
    ...

# Initialize ray session
ray.init(address="ray://<hostname>:10001",
         runtime_env={
                        "pip": [
                            "torch", 
                            "scikit-learn",
                            "transformers",
                            "datasets"
                        ]
                     },
         # Retrieve input data from S3
         working_dir="s3://<s3_bucket>/data")
        
# Create Trainer
trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    train_loop_config={
                        "batch_size": 64,
                        "epochs": 10,
                        "use_gpu": True
                      },
    # Increase num_workers to scale out the cluster
    scaling_config=ScalingConfig(num_workers=20),
    run_config = RunConfig(
        sync_config=SyncConfig(
            # Store checkpoints to S3
            upload_dir="s3://<s3_bucket>/checkpoint"
        )
    )
)

# Launch training job and print results
results = trainer.fit()
print(results.metrics)

With Ray on AWS, customers can orchestrate their Ray-based machine learning workflows using Amazon SageMaker Pipelines, Amazon Step Functions, Apache Airflow, or Ray Workflows. Customers can also track experiments using SageMaker Experiments or MLflow as shown here:

MLFlow

Kubernetes and the KubeRay project

Amazon EKS supports Ray on Kubernetes through the KubeRay EKS Blueprint, contributed by the Amazon EKS team, that quickly deploys a scalable and observable Ray cluster on your Amazon EKS cluster. As compute-demand increases or decreases, Ray works with the Kubernetes-native autoscaler to resize the Amazon EKS cluster as needed. Here is an example of a Grafana dashboard from a Ray cluster of 2 nodes created with the KubeRay EKS Blueprint:

Grafana dashboard for Ray

Summary

In this post, we highlighted AWS contributions to the scalability and operational efficiency of Ray on AWS. We also showed how AWS customers use Ray with AWS-managed services for secure, scalable, and enterprise-ready workloads across the entire data processing and AI/ML pipeline. Going forward, we will continue to work closely with the Ray community to improve Ray’s resilience and large-scale data processing – as well as integrate more AWS services for enhanced networking, data streaming, job queuing, machine learning, and much more!

We encourage you to set up Ray on AWS by following our samples for Amazon SageMaker, Amazon EC2, Amazon EKS, and Amazon EMR. For more information on Ray on AWS, check out the Ray-AWS documentation. If you are interested in improving Ray on AWS, please join the Ray community and send us a message on Slack. We welcome your feedback because it helps up prioritize the next features to contribute to the Ray community. And lastly, please join us for Ray and AWS monthly community events online.