Amazon EC2

Amazon EC2 UltraClusters

Run HPC and ML applications at scale

Get started with P6e-GB200

Get Started with Trn2

Why Amazon EC2 UltraClusters?

Amazon Elastic Compute Cloud (Amazon EC2) UltraClusters can help you scale to thousands of GPUs or purpose-built ML AI chips, such as AWS Trainium, to get on-demand access to a supercomputer. They democratize access to supercomputing-class performance for machine learning (ML), generative AI, and high performance computing (HPC) developers through a simple pay-as-you-go usage model without any setup or maintenance costs. Amazon EC2 instances that are deployed in EC2 UltraClusters include P6e-GB200, P6-B200, P5en, P5e, P5, P4d, Trn2, and Trn1 instances.

EC2 UltraClusters consist of thousands of accelerated EC2 instances that are co-located in a given AWS Availability Zone and interconnected using Elastic Fabric Adapter (EFA) networking in a petabit-scale nonblocking network. EC2 UltraClusters also provide access to Amazon FSx for Lustre, a fully managed shared storage built on the most popular high-performance, parallel file system to quickly process massive datasets on demand and at scale with sub-millisecond latencies. EC2 UltraClusters provide scale-out capabilities for distributed ML training and tightly coupled HPC workloads.

Benefits

EC2 UltraClusters help you reduce training times and time-to-solution from weeks to just a few days. This helps you iterate at a faster pace and get your deep learning (DL), generative AI, and HPC applications to market more quickly.

EC2 UltraClusters are supported on a growing list of EC2 instances and give you the flexibility to choose the right compute option to maximize performance while keeping costs under control for your workload.

Features

High-performance networking

EC2 instances deployed in EC2 UltraClusters are interconnected with EFA networking to improve performance for distributed training workloads and tightly coupled HPC workloads. P6e-GB200 UltraServers deliver up to 28.8 terabits per second of total EFAv4 networking. P6-B200 instances deliver up to 3.2 terabits per second of EFAv4 networking. Trn2 UltraServers have 12.8 terabits per second of EFAv3 networking. P5en, P5e, P5, and Trn2 instances deliver up to 3,200 Gbps; Trn1 instances deliver up to 1,600 Gbps; and P4d instances deliver up to 400 Gbps of EFA networking. EFA is also coupled with NVIDIA GPUDirect Remote Direct Memory Access (RDMA) (P6-B200, P5en, P5e, P5, P4d) and NeuronLink (Trn2, Trn1) to enable low-latency accelerator-to-accelerator communication between servers with operating system bypass.

High-performance storage

EC2 UltraClusters use FSx for Lustre, fully managed shared storage built on the most popular high-performance parallel file system. With FSx for Lustre, you can quickly process massive datasets on demand and at scale, and deliver sub-millisecond latencies. The low-latency and high-throughput characteristics of FSx for Lustre are optimized for DL, generative AI, and HPC workloads on EC2 UltraClusters. FSx for Lustre keeps the GPUs and AI chips in EC2 UltraClusters fed with data, accelerating the most demanding workloads. These workloads include large language model (LLM) training, generative AI inferencing, DL, genomics, and financial risk modeling. You can also get access to virtually unlimited cost-effective storage with Amazon Simple Storage Service (Amazon S3).

Instances and UltraServers supported

P6e-GB200 UltraServers

Accelerated by NVIDIA GB200 NVL72, P6e-GB200 instances in an UltraServer configuration offer the highest GPU AI training and inference performance in Amazon EC2.

Learn more

P6-B200 instances

Amazon EC2 P6-B200 instances, accelerated by NVIDIA Blackwell GPUs, offer high- performance instances for AI training, inference, and HPC.

Learn more

Trn2 instances and UltraServers

Powered by AWS Trainium2 AI chips, Trn2 instances offer up to 30 to 40% better price-performance over comparable GPU-based instances.

Learn more

P5en, P5e, and P5 instances

Powered by NVIDIA H200 Tensor Core GPUs, P5en and P5e instances provide the high performance in Amazon EC2 for ML training and HPC applications. P5 instances are powered by NVIDIA H100 Tensor Core GPUs.

Learn more

P4d instances

Learn more

Trn1 instances

Powered by AWS Trainium AI chips, Trn1 instances are purpose built for high-performance ML training. They offer up to 50% cost-to-train savings over comparable EC2 instances.

Learn more

Get started

Getting started

Sign up for an AWS account

Instantly get access to the AWS Free Tier

Tutorial

Learn with 10-minute tutorials

Explore and learn with simple tutorials

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Amazon EC2 UltraClusters

Why Amazon EC2 UltraClusters?

Benefits

Features

High-performance networking

High-performance storage

Instances and UltraServers supported

P6e-GB200 UltraServers

P6-B200 instances

Trn2 instances and UltraServers

P5en, P5e, and P5 instances

P4d instances

Trn1 instances

Get started

Sign up for an AWS account

Learn with 10-minute tutorials

Did you find what you were looking for today?

Learn

Resources

Developers

Help

Amazon EC2 UltraClusters

Why Amazon EC2 UltraClusters?

Benefits

Faster time to solution for distributed training and HPC

On-demand access to an exascale supercomputer

Flexibility to optimize performance and cost

Features

High-performance networking

High-performance storage

Instances and UltraServers supported

P6e-GB200 UltraServers

P6-B200 instances

Trn2 instances and UltraServers

P5en, P5e, and P5 instances

P4d instances

Trn1 instances

Get started

Sign up for an AWS account

Learn with 10-minute tutorials

Did you find what you were looking for today?

Learn

Resources

Developers

Help