Optimizing and Scaling Machine Learning Training

with Managed Spot Training for Amazon SageMaker

Amazon SageMaker is a fully managed machine learning service. With Amazon SageMaker, data scientists and developers can quickly build and train machine learning models, and then deploy them into a production-ready hosted environment.

Amazon EC2 Spot Instances offer spare compute capacity available in the AWS Cloud at steep discounts compared to On-Demand prices. EC2 can interrupt Spot Instances with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. Some examples are machine learning, analytics, containerized workloads, high-performance computing (HPC), stateless web servers, rendering, CI/CD, and other test and development workloads.

With Amazon SageMaker you can use EC2 Spot Instances for your training jobs using Managed Spot Training. Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long Amazon SageMaker waits for a job to run using Amazon EC2 Spot instances. Metrics and logs generated during training runs are available in CloudWatch. Amazon SageMaker will restart your training jobs if a Spot Instance is interrupted. You can also configure Managed Spot Training jobs to use checkpoints. Amazon SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, Amazon SageMaker copies the data from Amazon S3 back into the local path.

About this Tutorial
Time 10-20 minutes      
Cost Less than $10
Use Case Compute, Machine Learning
Products Amazon SageMaker, EC2 Spot Instances
Level 300
Last Updated April 10, 2020

Already have an account? Log in to your account

Step 1: Access Amazon SageMaker

1.1 — Open a browser and navigate to Amazon SageMaker Console, alternatively you can search for SageMaker, or locate Amazon SageMaker under the Machine Learning section of the console landing page. If you already have an AWS account, login to the console. Otherwise, create a new AWS account to get started.

1.2 — On the top right corner, select the region where you want to conduct SageMaker Training.

1.3 — Click on Notebook instances within the Overview section or the left panel under Notebook.

Step 2: Launch a notebook instance

2.1 — Click on Create notebook instance in the Notebook instances window.

2.2 — Enter a name such as “ManagedSpotTraining” into the Notebook instance name field in the Notebook instance settings section.

Scroll down to the Permissions and encryption section and select Create a new role from the IAM role dropdown.

Leave all remaining options at their default settings.

2.3 — Select Any S3 bucket from the Create an IAM role modal window and then click Create role.

2.4 — Click Create notebook instance.

Step 3: Open sample notebook

3.1 — Click on the Open JupyterLab link next to the Notebook Instance you created in Step 2.

3.2 — Click on the Amazon SageMaker sample notebooks icon on the left panel to access a list of sample notebooks.

3.3 — Locate and click on the managed_spot_training_object_detection.ipynb sample notebook under the Introduction to Amazon Algorithms section.

This action will open a read-only copy of this notebook.

3.4 — Click on Create a Copy located in the top right corner of the newly opened sample notebook.

Note: You must create a copy of the read-only notebook in order to execute the notebook in Step 4.

3.5 — Click Create Copy

Step 4: Execute example notebook

4.1 — Click on the “Run the selected cells and advance” button to step through each cell of the Example Notebook. The blue bar to the left of the cell indicates the currently selected cell.

This notebook is an end-to-end example introducing the Amazon SageMaker Object Detection algorithm. In this sample, we will demonstrate how to train an object detection model on the Pascal VOC dataset using the Single Shot multibox Detector (SSD) algorithm. The notebook is configured to use Managed Spot Training for Amazon SageMaker to train this model on Spot Instances.

Object detection is the process of identifying and localizing objects in an image. A typical object detection solution takes in an image as input and provides a bounding box on the image where a object of interest is along with identifying what object the box encapsulates.

This sample notebook is a modified version of the notebook located here.

Continue running the notebook using the “Run the selected cells and advance” button until you reach Step C.

4.2 — As you execute cells throughout the notebook, you will see output after the execution of each cell, and the [*] indicator will display a step incrementer [N] where N equals the step that was executed.

Continue executing cells within this notebook until you reach the Object Detection using Managed Spot Training section.


4.3 — When you reach the Object Detection using Managed Spot Training section, pause to review the following configuration options.

train_use_spot_instances = True


train_max_wait = 3600 if train_use_spot_instances else None

These variables are used when configuring the Amazon SageMaker training job as described in Step D below. They configure the training job to use Spot Instances when provisioning Instances for the training job. They also configure the timeout behavior, to specify how long Amazon SageMaker should wait for Spot Instances to become available, and the maximum total duration for the training job.

4.4 — The sample notebook uses the Sagemaker Python SDK to configure high level interfaces that simplify the training and deployment of models in Amazon SageMaker. One of these interfaces is the SageMaker.estimator. Estimator interface. With this interface, configuring a training job is as simple as passing a few additional parameters to the interface when instantiated.




These options use the variables defined in Step C to configure train_use_spot_instances to True, train_max_run to 3600 seconds, and train_max_wait to 3600 seconds. More details on these parameters can be found below.

train_use_spot_instances (bool) – Specifies whether to use SageMaker Managed Spot instances for training. If enabled then thetrain_max_wait arg should also be set. (default:False).

train_max_run (int) – Timeout in seconds for training (default: 24 * 60 * 60). After this amount of time Amazon SageMaker terminates the job regardless of its current status.

train_max_wait (int) – Timeout in seconds waiting for spot training instances (default: None). After this amount of time Amazon SageMaker will stop waiting for Spot instances to become available (default:None).

You can learn more about the SageMaker Python SDK and the Estimator interface in the documentation provided here.

4.5 — Continue running the notebook using the “Run the selected cells and advance” button until you reach the end of the notebook.

Take note of the following command:

od_model.fit(inputs=data_channels, logs=True)

When executed, this command will start the model training job using the provided configuration options. During execution of the training job, feedback on progress will be outputted as shown.

While the model training job is being executed you can learn more about Amazon SageMaker Managed Spot Training here.

The training job will take fewer than 10 minutes to complete.

Step 5: Review savings

5.1 — When the trainig job is completed, the SageMaker Python SDK will output the savings achieved when using Managed Spot Training as well as metrics around the total training seconds and billable training seconds.

You can calculate the savings from using managed spot training using the formula (1 - BillableTimeInSeconds / TrainingTimeInSeconds) * 100. For example, if BillableTimeInSeconds is 100 and TrainingTimeInSecondsis 500, the savings is 80%.

You can learn more about how this savings is calaulated here.

5.2 — You can also view this savings through the console.

Return to the Amazon SageMaker console and click on the Training jobs section under Training in the left hand section list.

Click on the recently completed training job to view details.

5.3 — Within the job details you can view the Managed Spot Training savings as well as metrics around the total training seconds and billable training seconds.

You can learn more about how this savings is calaulated here.


You’ve now trained an Object Detection model using an Amazon SageMaker Training job on EC2 Spot Instances. Now you are ready to integrate Spot Instances into your Amazon SageMaker training jobs and start optimizing your Machine Learning workloads for cost and performance.

Was this tutorial helpful?

Thank you
Please let us know what you liked.
Sorry to disappoint you
Is something out-of-date, confusing or inaccurate? Please help us improve this tutorial by providing feedback.

Using Amazon SageMaker Spot Managed Training

Now that you have learned how to use EC2 Spot Instances with Amazon SageMaker you’re ready to implement Managed Spot Training, and the other best practices you learned into your own workloads. If you would like to continue your learning, we recommend following the self-paced workshop located here.

Read the documentation

Learn about functionality and capabilities of Amazon SageMaker Managed Spot Training by reading the Managed Spot Training in Amazon SageMaker guide.

Explore Amazon EC2 Spot Instances

If you want to learn more about Amazon EC2 Spot Instances, visit the Amazon EC2 Spot Instances product page to explore documentation, videos, blogs, and more.