Amazon SageMaker AI Now Supports Capacity-Aware Inference with Automatic Instance Fallback

Posted on: May 1, 2026

Amazon SageMaker AI inference endpoints now support flexible provisioning across a prioritized list of instance types. When your preferred instance type has insufficient capacity, SageMaker AI automatically provisions from the next available option in your list — keeping endpoint creation and autoscaling running smoothly without manual intervention. This gives teams deploying AI/ML models in production the resilience to handle capacity constraints gracefully, ensuring endpoints come up reliably and scale on demand.

With instance pool support, you define a prioritized list of instance types and SageMaker AI automatically provisions capacity by working through your list in order. This applies across endpoint creation, updates, and scaling. When scaling down, SageMaker AI removes lowest-priority instances first, preserving your preferred infrastructure as the fleet contracts. This works for Single Model Endpoints, InferenceComponent-based endpoints, and Asynchronous Inference endpoints — including endpoints that scale to zero, where SageMaker AI provisions from your highest-priority available pool when scaling back up.

Because fallback instance types differ in GPU memory and compute capability, you can specify a different optimized model for each instance type in your priority list. You can prepare these artifacts yourself or use SageMaker AI inference recommendations, which automatically generates hardware-specific optimized configurations per instance type. Additionally, per-instance-type CloudWatch metrics give you visibility into latency, throughput, GPU utilization, and instance count by hardware type within a single endpoint.

This capability is available today in US East (N. Virginia), US East (Ohio), US West (Oregon), Canada (Central), South America (São Paulo), Europe (Ireland), Europe (London), Europe (Frankfurt), Europe (Stockholm), Europe (Zurich), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Mumbai), and Asia Pacific (Jakarta). To learn more, visit the Amazon SageMaker AI documentation.

Amazon SageMaker AI Now Supports Capacity-Aware Inference with Automatic Instance Fallback

Learn

Resources

Developers

Help