AWS Partner Network (APN) Blog

Building Reliable and Scalable Generative AI Infrastructure on AWS with Ray and Anyscale

By Phi Nguyen, Technical GTM Lead – Anyscale
By Vedant Jain, Sr. AI/ML Specialist Partner Solutions Architect – AWS


Generative artificial intelligence (AI) has lowered the barriers for many users on how foundation models (FMs) are used to transform products and experiences across industries. Generative AI does pose several challenges, however, for organizations looking to operationalize those capabilities.

Overcoming challenges—such as cost efficiency, data privacy concerns where organizations are reluctant to share their data with AI infrastructure operating outside of their cloud environment, and future-proofing in a rapidly evolving field—is crucial to unlocking the full potential of generative AI. Doing so enables widespread adoption in various applications, from chatbots and virtual assistants to content generation and language translation.

Ray is an open source, unified compute framework that makes it easy to scale AI and Python workloads. It’s a flexible Python-native distributed computing framework to easily parallelize existing AI and Python applications on a laptop, and scale to a cluster on the cloud or on-premises with no code changes.

There are many ways to deploy a Ray cluster on Amazon Web Services (AWS). If you choose a self-managed deployment, you can deploy a ray cluster on Amazon Elastic Compute Cloud (Amazon EC2) or on Amazon Elastic Kubernetes Service (Amazon EKS) using the Kuberay operator.

In this post, we will dive into some of the challenges we’re seeing at the workload level and showcase how Ray and Anyscale can help. Ray simplifies distributed computing, while Anyscale is an AWS Specialization Partner with the Machine Learning Competency that provides a fully managed service allowing organizations to accelerate building generative AI applications on AWS.

Open Source Large Language Model Stack

Developers can leverage the power of AWS to build their large language model (LLM) applications by utilizing various services and components across multiple layers:

  • Compute layer: Anyscale manages and optimizes the lifecycle of the different EC2 instance types powering the Ray cluster.
  • Foundational model and fine tuning layers: Anyscale endpoints and Anyscale workspaces help users leverage pre-trained models or open-source models from Hugging Face.
  • Orchestration layer: Developers can use open-source frameworks like LangChain and LlamaIndex for orchestration of LLM applications.
  • Deployment layer: Users can use Ray Serve, a flexible and agnostic framework to deploy real-time model serving efficiently.
  • In-context learning layer: The open-source community provides various vector databases for grounding the answer and mitigating hallucinations in the in-context learning layer.

Recently, open-source Ray integration with AWS machine learning (ML) accelerators was announced; specifically, AWS Trainium and AWS Inferentia2, which provide excellent price-to-performance ratio and reduce training and inferencing costs when working with large language models.

In the next section, we’ll dive into some of the challenges at the workload level and see how Ray and Anyscale can help.

Distributed Training

Training a large language model can be a daunting task, especially as you scale the number of parameters up to the billions of parameters range. Hardware failures, managing large cluster dependencies, scheduling jobs, and optimizing for graphics processing units (GPUs) are all challenges when training an LLM.

To run training for LLMs efficiently, developers need to partition the neural network across its computation graph. Based on the GPU cluster available, ML practitioners must adhere to a strategy that optimizes across different parallelization dimensions to enable efficient training.

Currently, optimizing training across different parallelization dimensions (data, model, and pipeline) can be a difficult and manual process. Existing dimensional partition strategies of an LLM include the following categories:

  • Inter-operator parallelism: Partition the full computation graph to discrete subgraphs. Each device computes its assigned subgraph and communicates with other devices upon finishing.
  • Intra-operator parallelism: Partition matrices for a given operator into submatrices. Each device computes its assigned submatrices and communicates with other devices when multiplication or addition takes place.
  • Combined: Both strategies can be applied to the same computation graph.

Benchmarks show a linear utilization of GPUs as you add more GPU nodes with no code change.


Figure 1 – Distributed training throughput with Alpa on Ray.

Leading organizations are using Ray to train their largest models, and benchmarks using Ray and the Alpa project have shown how to efficiently scale beyond 1,000 GPUs. For more information, check out this Anyscale blog post on training 175B parameter language models at 1,000 GPU scale with Alpa and Ray.


Fine-tuning is a method where users use a pre-trained model and adapt it to perform specific tasks or cater to a particular domain of interest without having to train an LLM from scratch. Many ML practitioners look to fine-tune their models using data for a given task.

Handling and loading those LLMs cost efficiently for fine-tuning can still be a challenge and often require data or model parallelism as well as distributed data loading.

Depending on your goal and product requirements, optimizing for cost or overall training time may require a different distributed compute strategy. For example, if you’re trying to provide a near real-time “dream booth” experience, scaling out the number of GPUs will decrease the overall time, as shown in Figure 2.

To learn more about this benchmark, see this Anyscale blog post on faster stable diffusion fine-tuning with Ray AIR.


Figure 2 – DreamBooth training times decrease linearly as you add more GPUs.

If you’re looking to optimize cost, using 32 x g4dn.4xl provides the best cost performance when fine-tuning a GPT-J model as shown in Figure 3. To find out more, read this Anyscale blog post on how to fine-tune and serve LLMs simply, quickly and cost effectively using Ray + DeepSpeed + Hugging Face.


Figure 3 – Cost and time comparison for fine-tuning a GPT-J model.

Scaling Embeddings

Generative AI-powered chatbots are becoming increasingly popular and require embedding a large corpus of data as part of a question-and-answering system. Embeddings are essential in LLM workloads as it’s used to encode the input and embed an entire corpus of data for information retrieval as part of the in-context learning in a popular approach called retrieval augmented generation.

Embedding text can be computationally expensive. Generating embeddings for a vast corpus may require substantial memory and processing power, making it challenging to scale and efficiently handle large-scale LLM workloads.

Ray can parallelize this process easily and reduce the time it takes to process embeddings. Ray can turbocharge LangChain and process embeddings 20x faster. Moreover, streaming and pipelining across input/output (IO), CPUs, and GPUs has become a key requirement for batch inference and for distributed training and fine-tuning. Ray data allows you to stream a pipeline across a cluster of CPUs and GPUs efficiently.

Model Serving

Serving generative AI models can be complex and expensive for the following reasons:

  • Computational resources: LLM models can be large, and serving those models cost efficiently often requires sharding or quantizing models.
  • Latency and response time: LLM models can have a high inference time due to their large parameter number and complex architectures.
  • Real-time inference pipeline: LLM embedded into products may require multiple steps, business logic, multi-model inference, and information retrieval. Architecting an efficient microservices architecture can create friction between the data scientist, ML engineers, and engineers responsible for last-mile deployment.
  • Autoscaling: This can be a challenge if you have a complex real time inference pipeline.

Just like with training, serving a trained LLM requires the neural network to be partitioned across its computation graph.

Ray allows you to flexibly author real-time deployment patterns by combining multiple models, ensembles, and business logic—all within Python on your laptop—in the cloud and in production with no code changes. Developers don’t need to rely on a single container or stitch together multiple microservices to deploy models.

With Ray Serve, you can deploy your real-time pipeline behind a unified endpoint using FastAPI, leveraging fine-grained resource allocation and autoscaling on a heterogeneous infrastructure. This allows for dynamic resource management, cost efficiency, and maximum availability while still maintaining high-quality results.


Figure 4 – Creating a real-time inference graph with Ray Serve deployment graph API.

By leveraging Ray’s capabilities, developers can overcome the infrastructure challenges associated with building LLM solutions. It provides the necessary tools and abstractions to handle distributed computing, resource management, data processing, and model serving, enabling efficient and scalable deployment for generative AI.

How Anyscale on AWS Addresses These Challenges

Anyscale has a three-part mission: 1) simplifying development of AI applications; 2) optimizing AI application runtime; and 3) providing a complete end-to-end AI application platform part of its SaaS offerings.

Anyscale Endpoints

Anyscale Endpoints is a serverless service designed to be the fastest and most cost-effective way to scale LLM applications. Currently, it supports the Llama-2 family of models, with plans to expand in the future. Additionally, Anyscale Endpoints enables fine-tuning and serving of models for even greater customization.

Performance is a major focus, aiming to maximize throughput and minimize latency. The underlying infrastructure includes Anyscale Cloud, Ray Core, and the Ray Serving Library.

Anyscale Private Endpoints

Anyscale Private Endpoints are deployed in your private AWS account, offering the same capabilities as Anyscale Endpoints but with more control over the infrastructure and the governance. It’s designed for organizations with specific privacy or security requirements.

Anyscale Platform

The Anyscale Platform offers developers and AI teams a seamless user experience to speed development and deployment of AI/ML workloads at scale. Companies using Anyscale benefit from rapid time-to-market and faster iterations across the entire AI lifecycle, as well as other key benefits:

  • Fully managed service: Anyscale operates Ray clusters on Amazon EC2 so users don’t have to. ML practitioners get access to an interactive, scalable compute environment that allows you to speed up data preprocessing, training, tuning, and serving/inference.
  • Seamless transition from development to production: Anyscale workspaces offer a fully featured development environment, including dependency management, persistent cluster storage, and integrated development environment (IDE), Jupyter, and Git integration. From there, applications can be deployed to Anyscale jobs for batch processing and training workloads or Anyscale services for highly available production serving.
  • Bring your own cloud: Built from the ground up with customer data security in mind, Anyscale runs on users’ infrastructure in their AWS account and inherits their security posture.
  • Observability: Iterate on workloads faster during development and deploy production ML workloads and services with confidence using Anyscale’s observability and dashboard features, as well as third-party observability integrations.
  • Optimize compute costs: Reduce workload compute costs with Anyscale’s integration with EC2 for autoscaling, auto-suspend features, and Spot instance support. Anyscale allows users to leverage existing agreements with AWS such as reserved instances and saving plans.
  • Governance and compliance: Anyscale provides user access controls for projects, workspaces, and clusters, as well as enterprise security and cost-tracking features.
  • Support from Ray’s creators: With Anyscale, users get dedicated support provided by Ray and Anyscale engineers, which can help you develop and move applications to production faster.

Architecture Overview

Anyscale’s data plane resides in the customer AWS account, and no data leaves the customer virtual private cloud (VPC) to ensure every customer has full control of their data.

  • No ingress from the Anyscale control plane is required.
  • Customer ingress is via customer-defined networking to the private IP address of the head node.
  • Egress must be established to the internet.


Figure 5 – Anyscale on AWS network diagram.

The Anyscale platform has been designed to deliver the highest performance in the cloud. ML practitioners can choose between various EC2 instance types depending on the time/cost sensitivity and size of their LLM training jobs.

The G5, G4, and C5 instance types are designed to accelerate model inferencing, a necessary requirement for hosting LLMs. To further reduce costs and get better price/performance ratio for generative AI workloads, AWS offers purpose-built ML accelerators.

AWS Trainium is purpose built for high-performance deep learning training of generative AI models including LLMs. Similarly, AWS Inferentia2 is a purpose built accelerator for inference and delivers 3x higher compute performance, up to 4x higher throughput, and up to 10x lower latency compared to first-generation AWS Inferentia.


In this post, we discussed challenges organizations are facing in bringing generative AI workloads to their enterprise. We introduced the Anyscale suite of services running on AWS and how it addresses those challenges.

We also provided a brief overview of Ray, an open-source unified framework to scale your machine learning and Python applications, and showcased the open-source LLM stack that Anyscale on AWS uses to power generative AI applications.

Users can use a Pythonic and developer-friendly API to scale across multiple Amazon EC2 instances seamlessly, giving developers flexibility and better utilization of their compute resources on AWS. Using the Anyscale platform on AWS, developers can accelerate their generative AI practice and build solutions from within their data plane in their own cloud environment, providing organizations better control over their data.

Furthermore, we discussed how organizations can use Ray’s flexible APIs to future proof their ML application design as transformative applications such as generative AI, LLM demand rethinking, and reorganization of AI/ML design primitives. Browse the many LLM examples in the Ray documentation and try it out.


Anyscale – AWS Partner Spotlight

Anyscale is an AWS Partner that provides a fully managed service allowing organizations of all sizes to accelerates building generative AI applications on AWS.

Contact Anyscale | Partner Overview