Announcing Managed Tiered Checkpointing for Amazon SageMaker HyperPod

Posted on: Sep 8, 2025

Today, Amazon Web Service (AWS) announces the general availability of managed tiered checkpointing for Amazon SageMaker HyperPod, a new capability designed to reduce model recovery time and minimize loss in training progress. As AI training scales, the likelihood of infrastructure failures increases, making efficient checkpointing critical. Traditional checkpointing methods can be slow and resource-intensive, especially for large models. SageMaker HyperPod's managed tiered checkpointing addresses this by using CPU memory to store frequent checkpoints for rapid recovery, while periodically persisting data to Amazon S3 for long-term durability. This hybrid approach minimizes training loss and significantly reduces the time to resume training after a failure.

With managed tiered checkpointing organizations can train reliably, with high throughput on large-scale clusters. The solution allows customers to configure checkpoint frequency and retention policies across both in-memory and persistent storage tiers. By storing frequently in memory customers can recover quickly while minimizing storage costs. Integrated with PyTorch's Distributed Checkpoint (DCP), customers can easily implement checkpointing with only a few lines of code, while gaining the performance benefits of in-memory storage.

This feature is currently available for SageMaker HyperPod clusters using the EKS orchestrator. Customers can enable managed tiered checkpointing by specifying an API parameter when creating or updating a HyperPod cluster via the CreateCluster or UpdateCluster API. Customers can then use the sagemaker-checkpointing python library to implement managed tiered checkpointing with minimal code changes to their training scripts.

Managed tiered checkpointing is available in all regions where SageMaker HyperPod is currently available. To learn more, please refer to the blog post and documentation.