AWS Open Source Blog

Ray Integration for AWS Trainium and AWS Inferentia is Now Available

AWS Trainium and AWS Inferentia are now integrated with Ray on Amazon Elastic Compute Cloud (EC2). Ray is an open source unified compute framework that makes it easy to build and scale machine learning applications. Ray will now automatically detect the availability of AWS Trainium and Inferentia accelerators to better support high-performance, low-cost scaling of machine learning and generative artificial intelligence (AI) workloads. This means that users can now further accelerate model training and/or serving on AWS by sharding generative AI or Large Language Models (LLMs) using tensor parallelism on Trainium/Inferentia accelerators.

Ray on Amazon Trn1 (powered by Amazon’s purpose-built Trainium chips) instances will offer excellent price/performance for distributed training and fine tuning of PyTorch models on AWS. Similarly, Amazon EC2 Inf2 instances, powered by AWS Inferentia, are purpose-built by AWS to provide high performance and reduce inferencing costs.

Machine learning models on AWS AI Accelerators are deployed to containers using AWS Neuron software development kit (SDK) to optimize the machine learning performance of Trainium and Inferentia based instances. With Ray integration, users will be able to build low latency and low-cost inference pipelines using Inferentia via tensor parallelism through Ray Serve API.

This feature will be made available as part of the Ray 2.7.0 release and is made possible by integrating Ray with Transformers Neuron, an open source software package that enables users to perform Large Language Model inference on the second generation Neuron hardware. Visit the list of supported models available in Transformers Neuron.

In the Hugging Face repository, you’ll find an example that compiles Open LLAMA-3B Large Language Model (LLM) and deploys the model on an AWS Inferentia (Inf2) instance using Ray Serve. It uses transformers-neuronx to shard the model across devices/neuron cores via Tensor parallelism.

AWS Trainium integration with high-level Ray Train API is currently in progress and the latest updates can be tracked via this link.

Get started today by cloning this repo and run the example from your local machine!

Maheedhar Reddy Chappidi

Maheedhar Reddy Chappidi

Maheedhar Reddy Chappidi is a Sr Software Development Engineer on the AWS Glue team. He is passionate about building fault tolerant and reliable distributed systems at scale. Outside of his work, Maheedhar is passionate about Mission Peak hiking and listening to podcasts.

Jianying Lang

Jianying Lang

Jianying Lang is a Principal Solutions Architect at AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in HPC and AI field. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD degree in Computational Physics from the University of Colorado at Boulder.

Scott Perry

Scott Perry

Scott Perry is a Solutions Architect on the Annapurna ML accelerator team at AWS. Based in Canada, he helps customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium. His interests include large language models, deep reinforcement learning, IoT, and genomics.

Vedant Jain

Vedant Jain

Vedant Jain is a Sr. AI/ML Specialist, working on strategic Generative AI initiatives. Prior to joining AWS, Vedant has held ML/Data Science Specialty positions at various companies such as Databricks, Hortonworks (now Cloudera) & JP Morgan Chase. Outside of his work, Vedant is passionate about making music, using Science to lead a meaningful life & exploring cuisines from around the world.