Containers
Tag: PyTorch
Fault tolerant distributed machine learning training with the TorchElastic Controller for Kubernetes
Introduction Kubernetes enables machine learning teams to run training jobs distributed across fleets of powerful GPU instances like Amazon EC2 P3, reducing training time from days to hours. However, distributed training comes with limitations compared to the more traditional microservice based applications typically associated with Kubernetes. Distributed training jobs are not fault tolerant, and a […]