Posted On: Nov 8, 2021

Amazon SageMaker Pipelines, a purpose-built service which enables customers to define and orchestrate their model building steps, now supports resuming execution of a failed/stopped pipeline, and retry policies for pipeline steps.

SageMaker Pipelines provides a variety of steps (e.g. processing, training, register model, callback etc.). Using these steps customers can productionize ML model building workflow as SageMaker Pipelines. Now, with these newly launched features, customers can exercise more operational control and flexibility in executing their SageMaker Pipelines.

Previously, customers had to start a new execution if the pipeline failed or stopped. Now, they can resume a failed/stopped pipeline from the previously failed/stopped steps. This feature makes it easier for customers to debug their pipelines and saves them time/resources by not re-executing previously successful steps.

Customers can also now configure retry policies for pipeline steps using the following parameters: max retry attempts, the time interval between retry attempts, rate of retry intervals, and max time-span of retry. These parameters can be configured at the pipeline/steps granularity and can be optionally customized for specific error types. Using this feature, customers can operationalize their model building pipelines and incorporate fail-safe policies for transient/intermittent errors.

These features are available in all AWS regions where Amazon SageMaker is available. To get started, create a new SageMaker Pipeline from the Amazon SageMaker SDK or Studio and visit our documentation pages on resume and retry policies.