Posted On: Jun 29, 2023

AWS Batch now supports specifying a minimum number of vCPUs (Min vCPUs) for Multi-Node Parallel (MNP) Jobs. MNP jobs allow users to run large-scale tightly coupled workloads, such as ML training, across multiple Amazon Elastic Compute Cloud (Amazon EC2) instances. With this launch, customers can now retain a specified number of vCPUs on a compute environment (CE) even when there are no jobs running. This feature enables customers to maintain a warm pool of healthy instances for MNP jobs, helping prevent situations where capacity is returned to EC2 due to rapid scale down.

To configure Min vCPUs for MNP, customers can specify the desired number of 'min vCPUs' either through the AWS Batch console or by using the CreateComputeEnvironment or the UpdateComputeEnvironment API. AWS Batch is designed to scale and retain the MNP capacity to the level defined by the customer and to retain the capacity level even when all jobs are completed in that compute environment. Refer to AWS Batch User Guide for more details regarding this feature.

MNP jobs allow users to run large-scale, high-performance computing workloads such as Large Language Models across multiple Amazon EC2 instances. By extending Min vCPUs to MNP jobs, customers can easily identify and retain healthy instances for future jobs, eliminating the need for additional boot times and hardware checks before every job run. This feature is now available in all AWS Regions where AWS Batch is currently available. To learn more about AWS Batch, see the AWS Batch User Guide. To learn more about the AWS Batch API, see the AWS Batch API Reference.