AWS Batch now supports gang-scheduling on Amazon EKS using multi-node parallel jobs
Today, AWS announces the general availability of Multi-Node Parallel (MNP) jobs in AWS Batch on Amazon Elastic Kubernetes Service (Amazon EKS). With AWS Batch MNP jobs you can run tightly-coupled High Performance Computing (HPC) applications like training multi-layer AI/ML models. AWS Batch helps you to launch, configure, and manage nodes in your Amazon EKS cluster without manual intervention.
You can configure MNP jobs using the RegisterJobsDefinition API or via job definitions sections of AWS Batch Management Console. With MNP jobs you can run AWS Batch on Amazon EKS workloads that span multiple Amazon Elastic Compute Cloud (Amazon EC2) instances. AWS Batch MNP jobs support any IP-based inter-instance communications framework, such as NVIDIA Collective Communications Library (NCCL), Gloo, Message Passing Interface (MPI), or Unified Collective Communication (UCC) as well as machine learning and parallel computing libraries such as PyTorch and Dask. For more information, see Multi-Node Parallel jobs page in the AWS Batch User Guide.
AWS Batch supports developers, scientists, and engineers in running efficient batch processing for ML model training, simulations, and analysis at any scale. Multi-Node Parallel jobs are available in any AWS Region where AWS Batch is available.