Amazon EC2 P4 Instances
High performance for ML training and HPC applications in the cloud
Why Amazon EC2 P4 Instances?
Amazon Elastic Compute Cloud (Amazon EC2) P4d instances deliver high performance for machine learning (ML) training and high performance computing (HPC) applications in the cloud. P4d instances are powered by NVIDIA A100 Tensor Core GPUs and deliver industry-leading high throughput and low-latency networking. These instances support 400 Gbps instance networking. P4d instances provide up to 60% lower cost to train ML models, including an average of 2.5x better performance for deep learning models compared to previous-generation P3 and P3dn instances.
P4d instances are deployed in clusters called Amazon EC2 UltraClusters that comprise high performance compute, networking, and storage in the cloud. Each EC2 UltraCluster is one of the most powerful supercomputers in the world, helping you run your most complex multinode ML training and distributed HPC workloads. You can easily scale from a few to thousands of NVIDIA A100 GPUs in the EC2 UltraClusters based on your ML or HPC project needs.
Researchers, data scientists, and developers can use P4d instances to train ML models for use cases such as natural language processing, object detection and classification, and recommendation engines. They can also use it to run HPC applications like pharmaceutical discovery, seismic analysis, and financial modeling. Unlike on-premises systems, you can access virtually unlimited compute and storage capacity, scale your infrastructure based on business needs, and spin up a multinode ML training job or a tightly coupled distributed HPC application in minutes, without any setup or maintenance costs.
Announcing the new Amazon EC2 P4d Instances
Benefits
Reduce ML training time from days to minutes
With the latest-generation NVIDIA A100 Tensor Core GPUs, each P4d instance delivers on average 2.5x better DL performance compared to previous-generation P3 instances. EC2 UltraClusters of P4d instances help everyday developers, data scientists, and researchers run their most complex ML and HPC workloads by giving access to supercomputing-class performance without any upfront costs or long-term commitments. The reduced training time with P4d instances boosts productivity, helping developers focus on their core mission of building ML intelligence into business applications.
Run the most complex multinode ML training with high efficiency
Developers can seamlessly scale to up to thousands of GPUs with EC2 UltraClusters of P4d instances. High-throughput, low-latency networking with support for 400 Gbps instance networking, Elastic Fabric Adapter (EFA), and GPUDirect RDMA technology help rapidly train ML models using scale-out/distributed techniques. EFA uses the NVIDIA Collective Communications Library (NCCL) to scale to thousands of GPUs, and GPUDirect RDMA technology enables low-latency GPU-to-GPU communication between P4d instances.
Lower the infrastructure costs for ML training and HPC
P4d instances deliver up to 60% lower cost to train ML models compared to P3 instances. Additionally, P4d instances are available for purchase as Spot Instances. Spot Instances take advantage of unused EC2 instance capacity and can lower your EC2 costs significantly with up to a 90% discount from On-Demand prices. With the lower cost of ML training with P4d instances, budgets can be reallocated to build more ML intelligence into business applications.
Easily get started and scale with AWS services
AWS Deep Learning AMIs (DLAMIs) and Amazon Deep Learning Containers make it easier to deploy P4d DL environments in minutes as they contain the required DL framework libraries and tools. You can also more easily add your own libraries and tools to these images. P4d instances support popular ML frameworks, such as TensorFlow, PyTorch, and MXNet. Additionally, P4d instances are supported by major AWS services for ML, management, and orchestration, such as Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), AWS Batch, and AWS ParallelCluster.
Features
Powered by NVIDIA A100 Tensor Core GPUs
NVIDIA A100 Tensor Core GPUs deliver unprecedented acceleration at scale for ML and HPC. NVIDIA A100’s third-generation Tensor Cores accelerate every precision workload, speeding time to insight and time to market. Each A100 GPU offers over 2.5x the compute performance compared to the previous-generation V100 GPU and comes with 40 GB HBM2 (in P4d instances) or 80 GB HBM2e (in P4de instances) of high-performance GPU memory. Higher GPU memory particularly benefits those workloads training on large datasets of high-resolution data. NVIDIA A100 GPUs use NVSwitch GPU interconnect throughput so each GPU can communicate with every other GPU in the same instance at the same 600 GB/s bidirectional throughput and with single-hop latency.
High-performance networking
P4d instances provide 400 Gbps networking to help customers better scale out their distributed workloads such as multinode training more efficiently with high-throughput networking between P4d instances as well as between a P4d instance and storage services such as Amazon Simple Storage Service (Amazon S3) and FSx for Lustre. EFA is a custom network interface designed by AWS to help scale ML and HPC applications to thousands of GPUs. To further reduce latency, EFA is coupled with NVIDIA GPUDirect RDMA to enable low-latency GPU-to-GPU communication between servers with OS bypass.
High-throughput, low-latency storage
Access petabyte-scale high-throughput, low-latency storage with FSx for Lustre or virtually unlimited cost-effective storage with Amazon S3 at 400 Gbps speeds. For workloads that need fast access to large datasets, each P4d instance also includes 8 TB NVMe-based SSD storage with 16 GB/sec read throughput.
Built on the AWS Nitro System
The P4d instances are built on the AWS Nitro System, which is a rich collection of building blocks that offloads many of the traditional virtualization functions to dedicated hardware and software to deliver high performance, high availability, and high security while also reducing virtualization overhead.
Customer testimonials
Here are some examples of how customers and partners have achieved their business goals with Amazon EC2 P4 instances.
Toyota Research Institute (TRI)

TRI-AD

TRI-AD

GE Healthcare

HEAVY.AI

Zenotech Ltd.

Aon

Rad AI

Product details
Instance Size
|
vCPUs
|
Instance Memory (GiB)
|
GPU – A100
|
GPU memory
|
Network Bandwidth (Gbps)
|
GPUDirect RDMA
|
GPU Peer to Peer
|
Instance Storage (GB)
|
EBS Bandwidth (Gbps)
|
---|---|---|---|---|---|---|---|---|---|
p4d.24xlarge
|
96
|
1152
|
8
|
320 GB HBM2 |
400 ENA and EFA
|
Yes
|
600 GB/s NVSwitch
|
8 x 1000 NVMe SSD
|
19
|
p4de.24xlarge
|
96
|
1152
|
8
|
640 GB HBM2e |
400 ENA and EFA
|
Yes
|
600 GB/s NVSwitch
|
8 x 1000 NVMe SSD
|
19
|
Getting started with P4d instances for ML
Using Amazon SageMaker
Amazon SageMaker is a fully managed service for building, training, and deploying ML models. When used together with P4d instances, customers can easily scale to tens, hundreds, or thousands of GPUs to train a model quickly at any scale without worrying about setting up clusters and data pipelines.
Using DLAMIs or Deep Learning Containers
DLAMI provides ML practitioners and researchers with the infrastructure and tools to accelerate DL in the cloud, at any scale. Deep Learning Containers are Docker images preinstalled with DL frameworks to make it easier to deploy custom ML environments quickly by letting you skip the complicated process of building and optimizing your environments from scratch.
Using Amazon EKS or Amazon ECS
If you prefer to manage your own containerized workloads through container orchestration services, you can deploy P4d instances with Amazon EKS or Amazon ECS.
Getting started with P4d instances for HPC
P4d instances are ideal to run engineering simulations, computational finance, seismic analysis, molecular modeling, genomics, rendering, and other GPU-based HPC workloads. HPC applications often require high network performance, fast storage, large amounts of memory, high compute capabilities, or all of the above. P4d instances support EFA that enables HPC applications using the Message Passing Interface (MPI) to scale to thousands of GPUs. AWS Batch and AWS ParallelCluster help HPC developers quickly build and scale distributed HPC applications.