Posted On: Oct 4, 2019

Amazon SageMaker now supports ml.p3dn.24xlarge, the most powerful P3 instance optimized for machine learning applications. This instance provides faster networking, which helps remove data transfer bottlenecks and optimizes the utilization of GPUs to deliver maximum performance for training deep learning models.

The ml.p3dn.24xlarge instances provide up to 100 Gbps of networking throughput, 96 custom Intel® Xeon® Scalable (Skylake) vCPUs, 8 NVIDIA® V100 Tensor Core GPUs with 32 GB of memory each, 300 GB/s NVLINK GPU interconnect, and 1.8 TB of local NVMe-based SSD storage. Compared to the next largest P3 instance, the 4X increase in network throughput, coupled with faster processors and local NVMe-based SSD storage, will enable developers to efficiently distribute their machine learning training jobs across several ml.p3dn.24xlarge instances and remove data transfer and preprocessing bottlenecks.

Below is a comparison of how Amazon SageMaker ml.p3dn.24xlarge instances compare to existing Amazon SageMaker ML P3 instances.

ML Instance Type GPUs - Tesla V100 GPU Peer to Peer GPU Memory (GB) vCPUs Memory (GB) Network Bandwidth EBS Bandwidth Local Instance Storage
ml.p3.2xlarge 1 N/A 16 8 (Broadwell) 61 Up to 10 Gbps 1.5 Gbps N/A
ml.p3.8xlarge 4 NVLink 64 32 (Broadwell) 244 10 Gbps 7 Gbps N/A
ml.p3.16xlarge 8 NVLink 128 64 (Broadwell) 488 25 Gbps 14 Gbps N/A
ml.p3dn.24xlarge 8 NVLink 256 96 (Skylake) 768 100 Gbps 14 Gbps 2 x 900 GB NVMe SSD

Amazon SageMaker ml.p3dn.24xlarge instances are available in the US East (N. Virginia) and US West (Oregon) AWS regions. With these instances customers can use the 1.8 TB of local NVMe-based SSD storage eliminating the need for creating and paying for additional ml storage volumes. Visit Amazon SageMaker documentation to learn more about using local NVMe-based SSD storage on this instance type. Visit the P3 page to learn more about how the P3 instances are being used by AWS customers.