Optimizing and Scaling Machine Learning Training
with Managed Spot Training for Amazon SageMaker
Amazon SageMaker is a fully managed machine learning service. With Amazon SageMaker, data scientists and developers can quickly build and train machine learning models, and then deploy them into a production-ready hosted environment.
Amazon EC2 Spot Instances offer spare compute capacity available in the AWS Cloud at steep discounts compared to On-Demand prices. EC2 can interrupt Spot Instances with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. Some examples are machine learning, analytics, containerized workloads, high-performance computing (HPC), stateless web servers, rendering, CI/CD, and other test and development workloads.
With Amazon SageMaker you can use EC2 Spot Instances for your training jobs using Managed Spot Training. Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long Amazon SageMaker waits for a job to run using Amazon EC2 Spot instances. Metrics and logs generated during training runs are available in CloudWatch. Amazon SageMaker will restart your training jobs if a Spot Instance is interrupted. You can also configure Managed Spot Training jobs to use checkpoints. Amazon SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, Amazon SageMaker copies the data from Amazon S3 back into the local path.
What you will accomplish
In this tutorial, you will:
- Launch and execute a SageMaker notebook instance
- Train an object detection model
Before starting this tutorial, you will need:
- An AWS account: If you don't already have an account, follow the Setting Up Your Environment getting started guide for a quick overview.
1.1 — Open a browser and navigate to Amazon SageMaker Console, alternatively you can search for SageMaker, or locate Amazon SageMaker under the Machine Learning section of the console landing page. If you already have an AWS account, login to the console. Otherwise, create a new AWS account to get started.
1.2 — On the top right corner, select the region where you want to conduct SageMaker Training.
1.3 — Click on Notebook instances within the Overview section or the left panel under Notebook.
2.1 — Click on Create notebook instance in the Notebook instances window.
2.2 — Enter a name such as “ManagedSpotTraining” into the Notebook instance name field in the Notebook instance settings section.
Scroll down to the Permissions and encryption section and select Create a new role from the IAM role dropdown.
Leave all remaining options at their default settings.
2.3 — Select Any S3 bucket from the Create an IAM role modal window and then click Create role.
2.4 — Click Create notebook instance.
3.1 — Click on the Open JupyterLab link next to the Notebook Instance you created in Step 2.
3.2 — Click on the Amazon SageMaker sample notebooks icon on the left panel to access a list of sample notebooks.
3.3 — Locate and click on the managed_spot_training_object_detection.ipynb sample notebook under the Introduction to Amazon Algorithms section.
This action will open a read-only copy of this notebook.
3.4 — Click on Create a Copy located in the top right corner of the newly opened sample notebook.
Note: You must create a copy of the read-only notebook in order to execute the notebook in Step 4.
3.5 — Click Create Copy
4.1 — Click on the “Run the selected cells and advance” button to step through each cell of the Example Notebook. The blue bar to the left of the cell indicates the currently selected cell.
This notebook is an end-to-end example introducing the Amazon SageMaker Object Detection algorithm. In this sample, we will demonstrate how to train an object detection model on the Pascal VOC dataset using the Single Shot multibox Detector (SSD) algorithm. The notebook is configured to use Managed Spot Training for Amazon SageMaker to train this model on Spot Instances.
Object detection is the process of identifying and localizing objects in an image. A typical object detection solution takes in an image as input and provides a bounding box on the image where a object of interest is along with identifying what object the box encapsulates.
This sample notebook is a modified version of the notebook located here.
Continue running the notebook using the “Run the selected cells and advance” button until you reach Step C.
4.2 — As you execute cells throughout the notebook, you will see output after the execution of each cell, and the [*] indicator will display a step incrementer [N] where N equals the step that was executed.
Continue executing cells within this notebook until you reach the Object Detection using Managed Spot Training section.
4.3 — When you reach the Object Detection using Managed Spot Training section, pause to review the following configuration options.
train_use_spot_instances = True train_max_run=3600 train_max_wait = 3600 if train_use_spot_instances else None
These variables are used when configuring the Amazon SageMaker training job as described in Step D below. They configure the training job to use Spot Instances when provisioning Instances for the training job. They also configure the timeout behavior, to specify how long Amazon SageMaker should wait for Spot Instances to become available, and the maximum total duration for the training job.
4.4 — The sample notebook uses the Sagemaker Python SDK to configure high level interfaces that simplify the training and deployment of models in Amazon SageMaker. One of these interfaces is the SageMaker.estimator. Estimator interface. With this interface, configuring a training job is as simple as passing a few additional parameters to the interface when instantiated.
train_use_spot_instances=train_use_spot_instances, train_max_run=train_max_run, train_max_wait=train_max_wait
These options use the variables defined in Step C to configure train_use_spot_instances to True, train_max_run to 3600 seconds, and train_max_wait to 3600 seconds. More details on these parameters can be found below.
train_use_spot_instances (bool) – Specifies whether to use SageMaker Managed Spot instances for training. If enabled then thetrain_max_wait arg should also be set. (default:False).
train_max_run (int) – Timeout in seconds for training (default: 24 * 60 * 60). After this amount of time Amazon SageMaker terminates the job regardless of its current status.
train_max_wait (int) – Timeout in seconds waiting for spot training instances (default: None). After this amount of time Amazon SageMaker will stop waiting for Spot instances to become available (default:None).
You can learn more about the SageMaker Python SDK and the Estimator interface in the documentation provided here.
4.5 — Continue running the notebook using the “Run the selected cells and advance” button until you reach the end of the notebook.
Take note of the following command:
When executed, this command will start the model training job using the provided configuration options. During execution of the training job, feedback on progress will be outputted as shown.
While the model training job is being executed you can learn more about Amazon SageMaker Managed Spot Training here.
The training job will take fewer than 10 minutes to complete.
5.1 — When the training job is completed, the SageMaker Python SDK will output the savings achieved when using Managed Spot Training as well as metrics around the total training seconds and billable training seconds.
You can calculate the savings from using managed spot training using the formula (1 - BillableTimeInSeconds / TrainingTimeInSeconds) * 100. For example, if BillableTimeInSeconds is 100 and TrainingTimeInSecondsis 500, the savings is 80%.
You can learn more about how this savings is calculated here.
5.2 — You can also view this savings through the console.
Return to the Amazon SageMaker console and click on the Training jobs section under Training in the left hand section list.
Click on the recently completed training job to view details.
5.3 — Within the job details you can view the Managed Spot Training savings as well as metrics around the total training seconds and billable training seconds.
You can learn more about how this savings is calculated here.
Congratulations! You’ve now trained an Object Detection model using an Amazon SageMaker Training job on EC2 Spot Instances. Now you are ready to integrate Spot Instances into your Amazon SageMaker training jobs and start optimizing your Machine Learning workloads for cost and performance.