Containers

Tag: PyTorch

Fault tolerant distributed machine learning training with the TorchElastic Controller for Kubernetes

Introduction Kubernetes enables machine learning teams to run training jobs distributed across fleets of powerful GPU instances like Amazon EC2 P3, reducing training time from days to hours. However, distributed training comes with limitations compared to the more traditional microservice based applications typically associated with Kubernetes. Distributed training jobs are not fault tolerant, and a […]