Amazon EC2 UltraClusters

Run HPC and ML applications at scale

Why Amazon EC2 UltraClusters?

Amazon Elastic Compute Cloud (Amazon EC2) UltraClusters can help you scale to thousands of GPUs or purpose-built ML AI chips, such as AWS Trainium, to get on-demand access to a supercomputer. They democratize access to supercomputing-class performance for machine learning (ML), generative AI, and high performance computing (HPC) developers through a simple pay-as-you-go usage model without any setup or maintenance costs. Amazon EC2 instances that are deployed in EC2 UltraClusters include P5en, P5e, P5, P4d, Trn2, and Trn1 instances.

EC2 UltraClusters consist of thousands of accelerated EC2 instances that are co-located in a given AWS Availability Zone and interconnected using Elastic Fabric Adapter (EFA) networking in a petabit-scale nonblocking network. EC2 UltraClusters also provide access to Amazon FSx for Lustre, a fully managed shared storage built on the most popular high-performance, parallel file system to quickly process massive datasets on demand and at scale with sub-millisecond latencies. EC2 UltraClusters provide scale-out capabilities for distributed ML training and tightly coupled HPC workloads.

Benefits

Features

High-performance networking

EC2 instances deployed in EC2 UltraClusters are interconnected with EFA networking to improve performance for distributed training workloads and tightly coupled HPC workloads. P5en, P5e, P5, and Trn2 instances deliver up to 3,200 Gbps; Trn1 instances deliver up to 1,600 Gbps; and P4d instances deliver up to 400 Gbps of EFA networking. EFA is also coupled with NVIDIA GPUDirect RDMA (P5en, P5e, P5, P4d) and NeuronLink (Trn2, Trn1) to enable low-latency accelerator-to-accelerator communication between servers with operating system bypass.

High-performance storage

EC2 UltraClusters use FSx for Lustre, fully managed shared storage built on the most popular high-performance parallel file system. With FSx for Lustre, you can quickly process massive datasets on demand and at scale, and deliver sub-millisecond latencies. The low-latency and high-throughput characteristics of FSx for Lustre are optimized for DL, generative AI, and HPC workloads on EC2 UltraClusters. FSx for Lustre keeps the GPUs and AI chips in EC2 UltraClusters fed with data, accelerating the most demanding workloads. These workloads include large language model (LLM) training, generative AI inferencing, DL, genomics, and financial risk modeling. You can also get access to virtually unlimited cost-effective storage with Amazon Simple Storage Service (Amazon S3).

Instance supported