Overview

Product video
Cedana AI Compute Fabric provides system-level checkpointing and migration for GPU workloads running on Amazon EKS and Slurm.
Cedana makes execution state portable across nodes and instances, allowing AI training, fine-tuning, inference, and distributed workloads to pause, move, and resume without losing progress.
Unlike application-level checkpoints, Cedana operates transparently at the system layer, requiring no code changes while preserving full process state, GPU memory, and distributed context.
By decoupling AI workloads from fixed infrastructure, Cedana increases GPU utilization and delivers up to 2x higher AI job throughput per GPU. Workloads automatically recover from node failures, spot interruptions, and maintenance events without restarting from scratch.
Teams can dynamically reprioritize jobs, rebalance clusters, consolidate underutilized GPUs, and safely run long jobs on Spot instances.
The result:
- Improved reliability
- Reduced wasted compute
- Lower cloud costs
- Shorter queue times
- Higher productivity per $/GPU
Cedana integrates in minutes with Amazon EKS, and Slurm environments and supports single-node and distributed multi-GPU/CPU workloads.
Ideal for AI startups, research labs, enterprises, and platform teams operating multi-tenant GPU clusters, Cedana enables infrastructure automation, spot resilience, SLA enforcement, and efficient AI factory operations across AWS.
Highlights
- Cloud-Native GPU Checkpointing for Amazon EKS Automatically checkpoint and migrate AI workloads across Amazon EKS without code changes. Preserve full execution state, including GPU memory and distributed processes, enabling seamless recovery from node failures, spot interruptions, and autoscaling events.
- Increase Throughput 2x and Reduce GPU Wait Times Boost AI training and inference throughput by eliminating lost work from failures and preemptions. Cedana improves GPU utilization, enables dynamic job prioritization on Amazon EKS and Slurm, and reduces queue times across multi-tenant GPU clusters.
- Automate Spot Instances for Long-Running AI Jobs Run training and stateful inference workloads reliably on Amazon EC2 Spot Instances without losing progress. Cedana automatically checkpoints and resumes GPU workloads across interruptions, enabling resilient Spot usage, lower cloud costs, and significantly higher throughput per $/GPU on Amazon EKS.
Details
Introducing multi-product solutions
You can now purchase comprehensive solutions tailored to use cases and industries.
Features and programs
Financing for AWS Marketplace purchases
Pricing
Dimension | Description | Cost/unit |
|---|---|---|
$/GB | Instance Memory under Management | $2.00 |
Vendor refund policy
Contact our support team for refund information.
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Software as a Service (SaaS)
SaaS delivers cloud-based software applications directly to customers over the internet. You can access these applications through a subscription model. You will pay recurring monthly usage fees through your AWS bill, while AWS handles deployment and infrastructure management, ensuring scalability, reliability, and seamless integration with other AWS services.
Resources
Vendor resources
Support
Vendor support
Please email support@cedana.ai
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.