Posted On: Jul 8, 2022

Amazon SageMaker model training now supports heterogeneous clusters, which enables launching training jobs that use multiple instance types in a single job. This new capability can improve your training cost by running different parts of the model training on the most suitable instance type. For example, we recently trained a ResNet-50 computer vision model on a heterogeneous cluster with ml.g5.xl and ml.c5n.2xl instances. This training job resulted in 13% lower cost than training the same model on a cluster with only ml.g5.xl instances with the same accuracy.

Certain machine learning workloads combine tasks that benefit from using different instance types for each task. For example, training computer vision models often involves combining the GPU-intensive task of neural network model training with the CPU-intensive task of data processing and augmentation. Running both tasks on a single instance type can lead to low GPU utilization, and as a result, wasted resources.

The heterogeneous clusters capability enables running SageMaker training jobs on multiple instance types, where the GPU-intensive tasks run on instance types like ml.p4d.24xl and the CPU-intensive tasks run on instance types like ml.c5n.18xl. This flexibility can increase GPU utilization, and therefore, lead to an improved overall cost-effectiveness. Heterogeneous clusters can be used without additional charges.

To learn more, please view the documentation for heterogeneous clusters. To get started, log into the Amazon SageMaker console.