Amazon EC2 UltraClusters

Run HPC and ML applications at scale

Why Amazon EC2 UltraClusters?

Amazon Elastic Compute Cloud (Amazon EC2) UltraClusters can help you scale to thousands of GPUs or purpose-built ML accelerators, such as AWS Trainium, to get on-demand access to a supercomputer. They democratize access to supercomputing-class performance for machine learning (ML), generative AI, and high performance computing (HPC) developers through a simple pay-as-you-go usage model without any setup or maintenance costs. Amazon EC2 P5 instances, Amazon EC2 P4d instances, and Amazon EC2 Trn1 instances are all deployed in Amazon EC2 UltraClusters.

EC2 UltraClusters consist of thousands of accelerated EC2 instances that are co-located in a given AWS Availability Zone and interconnected using Elastic Fabric Adapter (EFA) networking in a petabit-scale nonblocking network. EC2 UltraClusters also provide access to Amazon FSx for Lustre, a fully managed shared storage built on the most popular high-performance, parallel file system to quickly process massive datasets on demand and at scale with sub-millisecond latencies. EC2 UltraClusters provide scale-out capabilities for distributed ML training and tightly coupled HPC workloads.

Amazon EC2 P5 and Trn1 instances use a second-generation EC2 UltraClusters architecture that provides a network fabric to enable fewer hops across the cluster, lower latency, and greater scale.

Benefits

EC2 UltraClusters help you reduce training times and time-to-solution from weeks to just a few days. This helps you iterate at a faster pace and get your deep learning (DL), generative AI, and HPC applications to market more quickly.

P5 instances are deployed in EC2 UltraClusters with up to 20,000 H100 GPUs to deliver over 20 exaflops of aggregate compute capability. Similarly, Trn1 instances can scale to 30,000 Trainium accelerators, and P4 instances scale to 10,000 A100 GPUs to deliver exascale compute on demand.

EC2 UltraClusters are supported on a growing list of EC2 instances and give you the flexibility to choose the right compute option to maximize performance while keeping costs under control for your workload.

Features

High-performance networking

EC2 instances deployed in EC2 UltraClusters are interconnected with EFA networking to improve performance for distributed training workloads and tightly coupled HPC workloads. P5 instances deliver up to 3,200 Gbps; Trn1 instances deliver up to 1,600 Gbps; and P4d instances deliver up to 400 Gbps of EFA networking. EFA is also coupled with NVIDIA GPUDirect RDMA (P5, P4d) and NeuronLink (Trn1) to enable low-latency accelerator-to-accelerator communication between servers with operating system bypass.

High-performance storage

EC2 UltraClusters use FSx for Lustre, fully managed shared storage built on the most popular high-performance parallel file system. With FSx for Lustre, you can quickly process massive datasets on demand and at scale, and deliver sub-millisecond latencies. The low-latency and high-throughput characteristics of FSx for Lustre are optimized for DL, generative AI, and HPC workloads on EC2 UltraClusters. FSx for Lustre keeps the GPUs and ML accelerators in EC2 UltraClusters fed with data, accelerating the most demanding workloads. These workloads include large language model (LLM) training, generative AI inferencing, DL, genomics, and financial risk modeling. You can also get access to virtually unlimited cost-effective storage with Amazon Simple Storage Service (Amazon S3).

Instance supported

Powered by NVIDIA H100 Tensor Core GPUs, P5 instances provide the highest performance in Amazon EC2 for ML training and HPC applications.

Learn more

Powered by NVIDIA A100 Tensor Core GPUs, P4d instances provide high performance for ML training and HPC applications.

Learn more

Powered by AWS Trainium accelerators, Trn1 instances are purpose built for high-performance ML training. They offer up to 50% cost-to-train savings over comparable EC2 instances.

Learn more