AWS Deep Learning AMI GPU PyTorch 2.0 (Ubuntu 20.04)

Release Date: March 28, 2023
Created On: March 27, 2023
Last Updated: March 15, 2024

For help getting started, please see the AWS Deep Learning AMI Developer Guide.
AMI Name format:

Deep Learning Proprietary Nvidia Driver AMI GPU PyTorch 2.0.${PATCH_VERSION} (Ubuntu 20.04) ${YYYY-MM-DD}
Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0.${PATCH_VERSION} (Ubuntu 20.04) ${YYYY-MM-DD}

The AMI includes the following:

Note

P5 Instance:

DeviceIndex is unique to each NetworkCard, and must be a non-negative integer less than the limit of ENIs per NetworkCard. On P5, the number of ENIs per NetworkCard is 2, meaning that the only valid values for DeviceIndex is 0 or 1. Below is the example of EC2 P5 instance launch command using awscli showing NetworkCardIndex from number 0-31 and DeviceIndex as 0 for first interface and DeviceIndex as 1 for rest 31 interrfaces.

aws ec2 run-instances --region $REGION \
--instance-type $INSTANCETYPE \
--image-id $AMI --key-name $KEYNAME \
--iam-instance-profile "Name=dlami-builder" \
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$TAG}]" \
--network-interfaces "NetworkCardIndex=0,DeviceIndex=0,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
 "NetworkCardIndex=1,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
 "NetworkCardIndex=2,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
 "NetworkCardIndex=3,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
 "NetworkCardIndex=4,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
 ....
 ....
 ....
 "NetworkCardIndex=31,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa"

Horovod:

Horovod is supported in the current pytorch conda environment on the DLAMI. However, Horovod will be removed from the conda environment for upcoming version of PyTorch v2.1. Customers will be able install the horovod libraries by following the horovod guidelines and install them on their DLAMIs for their distributed training jobs.

Version 2.0.1

Release Date: 2024-03-12

AMI Names:

Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20240312

Updated

Updated DLAMI with OSS Nvidia driver with G4dn and G5 support, based on it current support looks like below:
- Deep Learning with Proprietary Nvidia Driver supports P3, P3dn, G3, G4dn, G5.
- Deep Learning with OSS Nvidia Driver supports G4dn, G5, P4, P5.
OSS Nvidia driver DLAMIs are recommended to be used for G4dn, G5, P4, P5.

Version 2.0.1

Release Date: 2023-12-05

AMI Names:

Please refer to Important changes to DLAMI

Deep Learning Proprietary Nvidia Driver AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20231205
Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20231205

Added

AWS Deep Learning AMI (DLAMI) is split into two separate groups:
- DLAMI that uses Nvidia Proprietary Driver (to support P3, P3dn, G3, G5, G4dn).
- DLAMI that uses Nvidia OSS Driver to enable EFA (to support P4, P5).
Please refer to public annoucement for more information on DLAMI split.
AWS cli queries for above are in the release notes under bullet point Query AMI-ID with AWSCLI (example region is us-east-1)

Updated

EFA updated from 1.26.1 to 1.29.0
GDRCopy updated from 2.3 to 2.4

Version 2.0.1

Release Date: 2022-09-29

AMI Name: Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20230926

Added

Added net.naming-scheme changes to fix unpredictable network interface naming issue (link) seen on P5. This change is made by setting net.naming-scheme=v247 in the linux boot arguments in the file /etc/default/grub
Updated AWS OFI NCCL plugin from v1.7.1 to v1.7.2

Version 2.0.1

Release Date: 2022-08-22

AMI Name: Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20230822

Updated

Added P5 EC2 instance support
Updated PyTorch from 2.0.1 supporting CUDA11.8 to PyTorch 2.0.1 supporting CUDA12.1
Updated NCCL from 2.16.2 to 2.18.3
Updated AWS OFI NCCL plugin v1.5.0 to v1.7.1
Updated EFA plugin from 1.21.0 to 1.24.1
Nvidia driver updated from 525.85.12 to 535.54.03
Added c-state changes to disable idle state of processor by setting the max c-state to C1. This change is made by setting `intel_idle.max_cstate=1 processor.max_cstate=1` in the linux boot arguments in file /etc/default/grub
For P5 instance: DeviceIndex is unique to each NetworkCard, and must be a non-negative integer less than the limit of ENIs per NetworkCard. On P5, the number of ENIs per NetworkCard is 2, meaning that the only valid values for DeviceIndex is 0 or 1. Below is the example of EC2 P5 instance launch command using awscli showing NetworkCardIndex from number 0-31 and DeviceIndex as 0 for first interface and DeviceIndex as 1 for rest 31 interfaces.
- aws ec2 run-instances --region $REGION \
  --instance-type $INSTANCETYPE \
  --image-id $AMI --key-name $KEYNAME \
  --iam-instance-profile "Name=dlami-builder" \
  --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$TAG}]" \
  --network-interfaces "NetworkCardIndex=0,DeviceIndex=0,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
  "NetworkCardIndex=1,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
  "NetworkCardIndex=2,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
  "NetworkCardIndex=3,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
  "NetworkCardIndex=4,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
  ....
  ....
  ....
  "NetworkCardIndex=31,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa"

Known Issue

CUDA error: driver shutting down can be observed with autograd when not explicitly passing “device” into PyTorch calls such as torch.tensor(). This error impacts pytorch-2.0.1 built with CUDA12.1 support. The workaround is to explicitly pass device into PyTorch calls, for more details please check this post. This will be fixed with our next release.
Installing pillow>=9.5.0 from conda-forge along with torchvision can cause dependency conflict due to deprecation of jpeg in conda-forge ecosystem in favor of libjpeg-turbo. We have notified torchvision maintainer about this issue via vision#7660, official fix/migration to libjpeg-turbo will come in early October, 2023.
Conda installing opencv along with the latest aws pytorch release could cause dependency conflict. You can workaround this problem by using pip install for opencv.

Version 2.0.1

Release Date: 2022-06-12

AMI Name: Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20230609

Updated

Updated PyTorch from 2.0.0 to 2.0.1

Version 2.0.0

Release Date: 2022-03-28

AMI Name: Deep Learning AMI GPU PyTorch 2.0.0 (Ubuntu 20.04) 20230328

Added

Initial release of Deep Learning AMI GPU PyTorch 2.0.0 (Ubuntu 20.04) series. Including a conda environment pytorch complimented with NVIDIA Driver R525, cuda=11.8.0, cudnn=8.7.0, nccl=2.16.2, and efa=1.21.0.
Added system custom NCCL 2.16.2 supporting the dynamic buffer depth patch for CUDA version 11.8.
- This is added to run NCCL tests located at /usr/local/cuda-11.8/efa/test-cuda-11.8/
- Custom NCCL source code available at: https://github.com/NVIDIA/nccl/tree/inc_nsteps
- Also, PyTorch package comes with statically linked custom NCCL 2.16.2 and it won’t use system custom NCCL.
AWS OFI NCCL is added to run NCCL tests. Also, PyTorch package comes with dynamically linked AWS OFI NCCL plugin as a conda package aws-ofi-nccl-dlc package as well and PyTorch will use that package instead of system AWS OFI NCCL.
Torch.Compile Support
- PT 2.0 includes the use of torch.compile() for training. Please refer to PyTorch Documentation here on usage.
- Torch.compile is a beta feature of PyTorch 2.0 and we are continuing to test it extensively on AWS. As of this release:
  - Torch.compile has been tested on P4, P3 and G5 instances.
  - Torch.compile has been tested with its default setting using TorchInductor as backend.
  - Torch.compile has been tested at float32 precision for training
- NOTE: The “Triton” based Torch Inductor is not supported on G3 instance. In this case, “eager mode” will need to be used. Please see OpenAI Triton GPU compatibility here.

Removed

We have temporarily not added fastai package due to their pending inductor backend support on their end. Once fastai adds support, we will add it back in upcoming releases.

Select your cookie preferences

AWS Deep Learning AMI GPU PyTorch 2.0 (Ubuntu 20.04)

The AMI includes the following:

Note

Version 2.0.1

Updated

Version 2.0.1

Added

Updated

Version 2.0.1

Added

Version 2.0.1

Updated

Known Issue

Version 2.0.1

Updated

Version 2.0.0

Added

Removed

Ending Support for Internet Explorer