AWS Deep Learning AMI GPU PyTorch 2.0 (Ubuntu 20.04)
Release Date: March 28, 2023
Created On: March 27, 2023
Last Updated: March 15, 2024
For help getting started, please see the AWS Deep Learning AMI Developer Guide.
AMI Name format:
- Deep Learning Proprietary Nvidia Driver AMI GPU PyTorch 2.0.${PATCH_VERSION} (Ubuntu 20.04) ${YYYY-MM-DD}
- Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0.${PATCH_VERSION} (Ubuntu 20.04) ${YYYY-MM-DD}
The AMI includes the following:
- Supported AWS Service: EC2
- Operating System: Ubuntu 20.04
- Compute Architecture: x86
- Supported EC2 Instances: (P2 not supported)
- Please refer to Important changes to DLAMI
- OSS Nvidia driver DLAMIs are recommended to be used for G4dn, G5, P4, P5.
- Deep Learning with Proprietary Nvidia Driver supports P3, P3dn, G3, G4dn, G5.
- Deep Learning with OSS Nvidia Driver supports G4dn, G5, P4, P5.
- Python: /opt/conda/envs/pytorch/bin/python
- NVIDIA Driver: 535.54.03
- NVIDIA CUDA12.1 stack:
- CUDA, NCCL and cuDDN installation path: /usr/local/cuda-12.1/
- Default CUDA: 12.1
- PATH /usr/local/cuda points to /usr/local/cuda-12.1/
- Updated below env vars:
- LD_LIBRARY_PATH to have /usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cud/targets/x86_64-linux/lib
- PATH to have /usr/local/cuda/bin/:/usr/local/cuda/include/
- Compiled NCCL Version for 12.1: 2.18.3
- Note: PyTorch package comes with statically linked custom NCCL 2.18.3 and it won’t use system NCCL.
- NCCL Tests Location:
- all_reduce, all_gather and reduce_scatter: /usr/local/cuda-xx.x/efa/test-cuda-xx.x/
- To run NCCL tests, LD_LIBRARY_PATH is already with updated with needed paths.
- Common PATHs are already added to LD_LIBRARY_PATH:
- /opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib
- LD_LIBRARY_PATH is updated with CUDA version paths
- /usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cud/targets/x86_64-linux/lib
- Common PATHs are already added to LD_LIBRARY_PATH:
- EFA Installer: 1.29.0
- Nvidia GDRCopy: 2.4
- AWS OFI NCCL: 1.7.2-aws
- Installation path: /opt/aws-ofi-nccl/ . Path /opt/aws-ofi-nccl/lib is added to LD_LIBRARY_PATH.
- Tests path for ring, message_transfer: /opt/aws-ofi-nccl/tests
- Note: PyTorch package comes with dynamically linked AWS OFI NCCL plugin as a conda package aws-ofi-nccl-dlc package as well and PyTorch will use that package instead of system AWS OFI NCCL.
- EBS volume type: gp3
- Python version: 3.10
- Query AMI-ID with AWSCLI (example region is us-east-1):
- OSS Nvidia Driver (Recommended to be used for G4dn, G5, P4, P5):
- aws ec2 describe-images --region us-east-1 --owners amazon --filters 'Name=name,Values=Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0.? (Ubuntu 20.04) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text
- Proprietary Nvidia Driver:
- aws ec2 describe-images --region us-east-1 --owners amazon --filters 'Name=name,Values=Deep Learning Proprietary Nvidia Driver AMI GPU PyTorch 2.0.? (Ubuntu 20.04) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text
- OSS Nvidia Driver (Recommended to be used for G4dn, G5, P4, P5):
Note
P5 Instance:
- DeviceIndex is unique to each NetworkCard, and must be a non-negative integer less than the limit of ENIs per NetworkCard. On P5, the number of ENIs per NetworkCard is 2, meaning that the only valid values for DeviceIndex is 0 or 1. Below is the example of EC2 P5 instance launch command using awscli showing NetworkCardIndex from number 0-31 and DeviceIndex as 0 for first interface and DeviceIndex as 1 for rest 31 interrfaces.
aws ec2 run-instances --region $REGION \ --instance-type $INSTANCETYPE \ --image-id $AMI --key-name $KEYNAME \ --iam-instance-profile "Name=dlami-builder" \ --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$TAG}]" \ --network-interfaces "NetworkCardIndex=0,DeviceIndex=0,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=1,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=2,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=3,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=4,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ .... .... .... "NetworkCardIndex=31,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa"
Horovod:
- Horovod is supported in the current pytorch conda environment on the DLAMI. However, Horovod will be removed from the conda environment for upcoming version of PyTorch v2.1. Customers will be able install the horovod libraries by following the horovod guidelines and install them on their DLAMIs for their distributed training jobs.
Version 2.0.1
Release Date: 2024-03-12
AMI Names:
- Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20240312
Updated
- Updated DLAMI with OSS Nvidia driver with G4dn and G5 support, based on it current support looks like below:
- Deep Learning with Proprietary Nvidia Driver supports P3, P3dn, G3, G4dn, G5.
- Deep Learning with OSS Nvidia Driver supports G4dn, G5, P4, P5.
- OSS Nvidia driver DLAMIs are recommended to be used for G4dn, G5, P4, P5.
Version 2.0.1
Release Date: 2023-12-05
AMI Names:
Please refer to Important changes to DLAMI
- Deep Learning Proprietary Nvidia Driver AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20231205
- Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20231205
Added
- AWS Deep Learning AMI (DLAMI) is split into two separate groups:
- DLAMI that uses Nvidia Proprietary Driver (to support P3, P3dn, G3, G5, G4dn).
- DLAMI that uses Nvidia OSS Driver to enable EFA (to support P4, P5).
- Please refer to public annoucement for more information on DLAMI split.
- AWS cli queries for above are in the release notes under bullet point Query AMI-ID with AWSCLI (example region is us-east-1)
Updated
- EFA updated from 1.26.1 to 1.29.0
- GDRCopy updated from 2.3 to 2.4
Version 2.0.1
Release Date: 2022-09-29
AMI Name: Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20230926
Added
- Added net.naming-scheme changes to fix unpredictable network interface naming issue (link) seen on P5. This change is made by setting net.naming-scheme=v247 in the linux boot arguments in the file /etc/default/grub
- Updated AWS OFI NCCL plugin from v1.7.1 to v1.7.2
Version 2.0.1
Release Date: 2022-08-22
AMI Name: Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20230822
Updated
- Added P5 EC2 instance support
- Updated PyTorch from 2.0.1 supporting CUDA11.8 to PyTorch 2.0.1 supporting CUDA12.1
- Updated NCCL from 2.16.2 to 2.18.3
- Updated AWS OFI NCCL plugin v1.5.0 to v1.7.1
- Updated EFA plugin from 1.21.0 to 1.24.1
- Nvidia driver updated from 525.85.12 to 535.54.03
- Added c-state changes to disable idle state of processor by setting the max c-state to C1. This change is made by setting `intel_idle.max_cstate=1 processor.max_cstate=1` in the linux boot arguments in file /etc/default/grub
- For P5 instance: DeviceIndex is unique to each NetworkCard, and must be a non-negative integer less than the limit of ENIs per NetworkCard. On P5, the number of ENIs per NetworkCard is 2, meaning that the only valid values for DeviceIndex is 0 or 1. Below is the example of EC2 P5 instance launch command using awscli showing NetworkCardIndex from number 0-31 and DeviceIndex as 0 for first interface and DeviceIndex as 1 for rest 31 interfaces.
- aws ec2 run-instances --region $REGION \
--instance-type $INSTANCETYPE \
--image-id $AMI --key-name $KEYNAME \
--iam-instance-profile "Name=dlami-builder" \
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$TAG}]" \
--network-interfaces "NetworkCardIndex=0,DeviceIndex=0,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
"NetworkCardIndex=1,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
"NetworkCardIndex=2,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
"NetworkCardIndex=3,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
"NetworkCardIndex=4,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
....
....
....
"NetworkCardIndex=31,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa"
- aws ec2 run-instances --region $REGION \
Known Issue
- CUDA error: driver shutting down can be observed with autograd when not explicitly passing “device” into PyTorch calls such as torch.tensor(). This error impacts pytorch-2.0.1 built with CUDA12.1 support. The workaround is to explicitly pass device into PyTorch calls, for more details please check this post. This will be fixed with our next release.
- Installing pillow>=9.5.0 from conda-forge along with torchvision can cause dependency conflict due to deprecation of jpeg in conda-forge ecosystem in favor of libjpeg-turbo. We have notified torchvision maintainer about this issue via vision#7660, official fix/migration to libjpeg-turbo will come in early October, 2023.
- Conda installing opencv along with the latest aws pytorch release could cause dependency conflict. You can workaround this problem by using pip install for opencv.
Version 2.0.1
Release Date: 2022-06-12
AMI Name: Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20230609
Updated
- Updated PyTorch from 2.0.0 to 2.0.1
Version 2.0.0
Release Date: 2022-03-28
AMI Name: Deep Learning AMI GPU PyTorch 2.0.0 (Ubuntu 20.04) 20230328
Added
- Initial release of Deep Learning AMI GPU PyTorch 2.0.0 (Ubuntu 20.04) series. Including a conda environment pytorch complimented with NVIDIA Driver R525, cuda=11.8.0, cudnn=8.7.0, nccl=2.16.2, and efa=1.21.0.
- Added system custom NCCL 2.16.2 supporting the dynamic buffer depth patch for CUDA version 11.8.
- This is added to run NCCL tests located at /usr/local/cuda-11.8/efa/test-cuda-11.8/
- Custom NCCL source code available at: https://github.com/NVIDIA/nccl/tree/inc_nsteps
- Also, PyTorch package comes with statically linked custom NCCL 2.16.2 and it won’t use system custom NCCL.
- AWS OFI NCCL is added to run NCCL tests. Also, PyTorch package comes with dynamically linked AWS OFI NCCL plugin as a conda package aws-ofi-nccl-dlc package as well and PyTorch will use that package instead of system AWS OFI NCCL.
- Torch.Compile Support
- PT 2.0 includes the use of torch.compile() for training. Please refer to PyTorch Documentation here on usage.
- Torch.compile is a beta feature of PyTorch 2.0 and we are continuing to test it extensively on AWS. As of this release:
- Torch.compile has been tested on P4, P3 and G5 instances.
- Torch.compile has been tested with its default setting using TorchInductor as backend.
- Torch.compile has been tested at float32 precision for training
- NOTE: The “Triton” based Torch Inductor is not supported on G3 instance. In this case, “eager mode” will need to be used. Please see OpenAI Triton GPU compatibility here.
Removed
- We have temporarily not added fastai package due to their pending inductor backend support on their end. Once fastai adds support, we will add it back in upcoming releases.