AWS Deep Learning Containers v3.4 for MXNet
The AWS Deep Learning Containers for MXNet include containers for Training for CPU and GPU, optimized for performance and scale on AWS.
Release Date: March 20, 2020
Created On: March 20, 2020
Last Updated: March 21, 2020
The AWS Deep Learning Containers for MXNet include containers for Training for CPU and GPU, optimized for performance and scale on AWS. These Docker images have been tested with Amazon SageMaker, EC2, ECS, and EKS and provide stable versions of NVIDIA CUDA, cuDNN, Intel MKL, and other required software components to provide a seamless user experience for deep learning workloads. All software components in these images are scanned for security vulnerabilities and updated or patched in accordance with AWS Security best practices.
Release Notes
Security Advisory
- AWS recommends that customers monitor critical security updates in the AWS Security Bulletin
Highlights of the Release
- Introduced SageMaker Python SDK for MXNet Training for Py3 version with sagemaker==1.50.17
- Updated smexperiments package for MXNet Training for Py3 version with smexperiments==0.1.7
- Updated GluonNLP package for MXNet Training with gluonnlp==0.9.1
- Updated SMDebug package for MXNet Training for Py3 version with smdebug==0.7.1
Prepackaged Deep Learning Frameworks Included
- Apache MXNet: MXNet is a flexible, efficient, portable and scalable open source library for deep learning. It supports declarative and imperative programming models, across a wide variety of programming languages, making it powerful yet simple to code deep learning applications. MXNet is efficient, inherently supporting automatic parallel scheduling of portions of source code that can be parallelized over a distributed environment. MXNet is also portable, using memory optimizations that allow it to run on mobile phones to full servers.
- branch/tag used: 1.6.0
- Justification: AWS MXNet 1.6.0 with improved functionality (DGL Preview, NumPy-like operators)
- Supported with CUDA 10.1 and Intel MKL-DNN
- branch/tag used: 1.6.0
- Keras: Deep Learning Library for Python
- MXNet integration with Keras v2.2.4.2
- Justification: Stable release
- MXNet integration with Keras v2.2.4.2
- Horovod: Horovod is a distributed training framework. The goal of Horovod is to easily take single-GPU deep learning program and train it on multiple GPUs. Horovod nodes communicate directly with each other instead of going through a centralized node and average gradients using the ring-allreduce algorithm.
- branch/tag used : v0.19.0
- Justification : Stable and well tested
- branch/tag used : v0.19.0
- GluonNLP: GluonNLP provides implementations of the state-of-the-art deep learning models in NLP, and build blocks for text data pipelines and models. It is designed for engineers, researchers, and students to fast prototype research ideas and products based on these models.
- branch/tag used: v0.9.1
- Justification : Stable and well tested
- branch/tag used: v0.9.1
- SageMaker Python SDK: SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker. With the SDK, you can train and deploy models using popular deep learning frameworks Apache MXNet, PyTorch, and TensorFlow.
- branch/tag used: v1.50.17
- Justification : Stable and well tested
- branch/tag used: v1.50.17
Bill of Materials: List of all components
- CPU: Training container
- aws-mxnet-mkl==1.6.0
- sagemaker-containers==2.8.1
- sagemaker-mxnet-training==3.1.6
- sagemaker==1.50.17 (for py3 only)
- horovod==0.19.0
- gluonnlp==0.9.1 (For py3 only)
- scipy==1.2.2
- scikit-learn==0.20.4
- pandas==0.24.2 (for py2)
- pandas==0.25.1 (for py3)
- Pillow==6.2.0
- h5py==2.10.0
- keras-mxnet==2.2.4.2
- requests==2.22.0
- numpy==1.16.5 (for py2)
- numpy==1.17.2 (for py3)
- smdebug==0.7.1 (for py3 only)
- smexperiments==0.1.7 (for py3 only)
- dgl==0.4.1 (for py3 only)
- awscli==1.18.21 (For py3 only)
- GPU: Training container
- cuda-command-line-tools-10-1
- libcudnn7=7.6.0.64-1+cuda10.1
- libnccl2=2.4.8-1+cuda10.1
- llibcublas-10.2.1.243
- cuda-cufft-10-1
- cuda-curand-10-1
- cuda-cusolver-10-1
- cuda-cusparse-10-1
- aws-mxnet-cu101mkl==1.6.0
- sagemaker-containers==2.8.1
- sagemaker-mxnet-training==3.1.6
- sagemaker==1.50.17 (for py3 only)
- gluonnlp==0.9.1 (For py3 only)
- horovod==0.19.0
- scipy==1.2.2
- scikit-learn==0.20.4
- pandas==0.24.2 (for py2)
- pandas==0.25.1 (for py3)
- Pillow==6.2.0
- h5py==2.10.0
- keras-mxnet==2.2.4.2
- requests==2.22.0
- numpy==1.16.5 (for py2)
- numpy==1.17.2 (for py3)
- smdebug==0.7.1 (for py3 only)
- smexperiments==0.1.7 (for py3 only)
- dgl-cu101==0.4.1 (py3 only)
- awscli==1.18.21 (For py3 only)
Python 2.7 and Python 3.6 Support
Python 2.7 and Python 3.6 are supported in the containers for all of these installed Deep Learning Frameworks:
- MXNet
- Keras
End of Life Notices
The Python open source community has officially ended support for Python 2 on January 1, 2020. The MXNet community has also announced that the MXNet 1.6.0 release will be the last one supporting Python 2. DLC releases with the next versions of the MXNet frameworks will not contain the Python 2 containers. Updates to the Python 2 DLC will be provided on previously published DLC versions only if there are security fixes published by the open source community for those versions. Previous releases of the MXNet DLC that contain Python 2 will continue to be available.
CPU Instance Type Support
The containers supports CPU instance types. MXNet is built with support for Intel MKLDNN v1.0 library support.
GPU Instance Type support
The containers supports GPU instance types and contain the following software components for GPU support.
- CUDA 10.1 / cuDNN 7.6.5.32-1+cuda10.1 / NCCL 2.4.8-1+cuda10.1
AWS Regions support
Available in the following regions:
Region |
Code |
US East (Ohio) |
us-east-2 |
US East (N. Virginia) |
us-east-1 |
US West (Oregon) |
us-west-2 |
US West (SFO) |
us-west-1 |
Asia Pacific (Mumbai) |
ap-south-1 |
Asia Pacific (Seoul) |
ap-northeast-2 |
Asia Pacific (Singapore) |
ap-southeast-1 |
Asia Pacific (Sydney) |
ap-southeast-2 |
Asia Pacific (Tokyo) |
ap-northeast-1 |
Central (Canada) |
ca-central-1 |
EU (Frankfurt) |
eu-central-1 |
EU (Ireland) |
eu-west-1 |
EU (London) |
eu-west-2 |
EU(Paris) |
eu-west-3 |
SA (Sau Paulo) |
sa-east-1 |
EU (Stockholm) | eu-north-1 |
AP East (Hong Kong) | ap-east-1 |
ME South (Bahrain) | me-south-1 |
Build and Test
- Built on: c5.18xlarge
- Tested on: c4.8xlarge, c5.18xlarge, g3.16xlarge, m4.16xlarge, p2.16xlarge, p3.16xlarge, p3dn.24xlarge
- Tested with MNIST and Resnet50/ImageNet datasets on EC2, ECS AMI (Amazon Linux AMI 2.0.20190614), EKS AMI (1.11-v20190614) and Amazon Sagemaker.
Known Issues
- Training certain models (e.g. ImageNet) with multiple GPUs and very large batch sizes may exhibit a larger memory footprint compared to previous versions. If you experience an out of memory error, we suggest reducing the batch size for the training.