Train and tune a deep learning model at scale

with Amazon SageMaker

In this tutorial, you learn how to use Amazon SageMaker to build, train, and tune a TensorFlow deep learning model.

Amazon SageMaker is a fully managed service that provides machine learning (ML) developers and data scientists with the ability to build, train, and deploy ML models quickly. Amazon SageMaker provides you with everything you need to train and tune models at scale without the need to manage infrastructure. You can use Amazon SageMaker Studio, the first integrated development environment (IDE) for machine learning, to quickly visualize experiments and track training progress without ever leaving the familiar Jupyter Notebook interface. Within Amazon SageMaker Studio, you can use Amazon SageMaker Experiments to track, evaluate, and organize experiments easily.

In this tutorial, you learn how to:

  1. Set up Amazon SageMaker Studio
  2. Download a public dataset using an Amazon SageMaker Studio Notebook and upload it to Amazon S3
  3. Create an Amazon SageMaker Experiment to track and manage training jobs
  4. Run a TensorFlow training job on a fully managed GPU instance using one-click training with Amazon SageMaker
  5. Improve accuracy by running a large-scale Amazon SageMaker Automatic Model Tuning job to find the best model hyperparameters
  6. Visualize training results

You’ll be using the CIFAR-10 dataset to train a model in TensorFlow to classify images into 10 classes. This dataset consists of 60,000 32x32 color images, split into 40,000 images for training, 10,000 images for validation and 10,000 images for testing.

About this Tutorial
Time 1 hour                                         
Cost Approx. $100
Use Case Machine Learning
Products Amazon SageMaker
Audience Developer
Level Intermediate
Last Updated July 1, 2021

Step 1. Create an AWS Account

The cost of this tutorial is approximately $100.

Already have an account? Sign-in

Step 2. Set up Amazon SageMaker Studio

Complete the following steps to onboard to Amazon SageMaker Studio and set up your Amazon SageMaker Studio Control Panel.

Note: For more information, see Get Started with Amazon SageMaker Studio in the Amazon SageMaker documentation.


a. Sign in to the Amazon SageMaker console

Note: In the top right corner, make sure to select an AWS Region where SageMaker Studio is available. For a list of Regions, see Onboard to Amazon SageMaker Studio.


b. In the Amazon SageMaker navigation pane, choose Amazon SageMaker Studio.
 
Note: If you are using Amazon SageMaker Studio for the first time, you must complete the Studio onboarding process. When onboarding, you can choose to use either AWS Single Sign-On (AWS SSO) or AWS Identity and Access Management (IAM) for authentication methods. When you use IAM authentication, you can choose either the Quick start or the Standard setup procedure. If you are unsure of which option to choose, see Onboard to Amazon SageMaker Studio and ask your IT administrator for assistance. For simplicity, this tutorial uses the Quick start procedure.

c. In the Get started box, choose Quick start and specify a user name.

d. For Execution role, choose Create an IAM role. In the dialog box that appears, choose Any S3 bucket and choose Create role.

Amazon SageMaker creates a role with the required permissions and assigns it to your instance.  


e. Click Submit.

Step 3. Download the dataset

Amazon SageMaker Studio notebooks are one-click Jupyter notebooks that contain everything you need to build and test your training scripts. SageMaker Studio also includes experiment tracking and visualization so that it’s easy to manage your entire machine learning workflow in one place.

Complete the following steps to create a SageMaker Notebook, download the dataset, convert the dataset into TensorFlow supported TFRecord format, and then upload the dataset to Amazon S3.  

Note: For more information, see Use Amazon SageMaker Studio Notebooks in the Amazon SageMaker documentation.


a. In the Amazon SageMaker Studio Control Panel, choose Open Studio.

b. In JupyterLab, on the File menu, choose New Launcher. In the Notebooks and compute resources section, for Select a SageMaker image, choose TensorFlow 1.15 Python 3.6 (optimized for CPU). Then, for Notebook, choose Python 3.
 
Note: In this step, you’re choosing a CPU instance which is used to run the SageMaker Notebook that downloads the dataset, builds your training scripts, submits Amazon SageMaker training jobs, and visualizes results. The training job itself runs on a separate instance type that you can specify, such as a GPU instance as you’ll see in Step 5.

c. Copy and paste the following code block into the code cell and select Run.
 
This code downloads the generate_cifar10_tfrecords.py script, downloads the CIFAR-10 dataset, and converts it into TFRecord format.
 
Note: While the code runs, an * appears between the square brackets. After a few seconds, the code execution completes and the * is replaced with a number.
https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/tensorflow_bring_your_own/utils/generate_cifar10_tfrecords.py
Copy and paste the following code into the code cell and select Run.
!pip install ipywidgets
!python generate_cifar10_tfrecords.py --data-dir cifar10

d. Upload the dataset to your default Amazon SageMaker Amazon S3 bucket. Copy and paste the following code into the code cell, and select Run.
 
You should see the Amazon S3 location for your dataset as the output.  
import time, os, sys
import sagemaker, boto3
import numpy as np
import pandas as pd

sess = boto3.Session()
sm   = sess.client('sagemaker')
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session(boto_session=sess)

datasets = sagemaker_session.upload_data(path='cifar10', key_prefix='datasets/cifar10-dataset')
datasets 

Step 4. Create an Amazon SageMaker Experiment

Now that you have downloaded and staged your dataset in Amazon S3, you can create an Amazon SageMaker Experiment. An experiment is a collection of processing and training jobs related to the same machine learning project. Amazon SageMaker Experiments automatically manages and tracks your training runs for you.

Complete the following steps to create a new experiment.

Note: For more information, see Experiments in the Amazon SageMaker documentation.


a. In your Jupyter Notebook, copy and paste the following code block into the code cell and select Run.
 
The code uses the smexperiments python package to create an experiment named sagemaker-training-experiments. This package comes pre-installed on Amazon SageMaker Studio Notebooks. You can customize the experiment name and description.
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent

training_experiment = Experiment.create(
                                experiment_name = "sagemaker-training-experiments", 
                                description     = "Experiment to track cifar10 training trials", 
                                sagemaker_boto_client=sm)

b. In the left toolbar, choose Components and registries (triangle icon), and then choose Experiments and trials. The new experiment sagemaker-training-experiments appears in the list.

Step 5. Create the trial and training script

To train a classifier on the CIFAR-10 dataset, you need a training script. In this step, you create your trial and training script for the TensorFlow training job. Each trial is an iteration of your end-to-end training job. In addition to the training job, the trial can also track preprocessing, post processing jobs as well as datasets and other metadata. A single experiment can include multiple trials which makes it easy for you to track multiple iterations over time within the Amazon SageMaker Studio Experiments pane.

Complete the following steps to create a new trial and training script for the TensorFlow training job.

Note: For more information, see Use TensorFlow with Amazon SageMaker in the Amazon SageMaker documentation.


a. In your Jupyter Notebook, copy and paste the following code block into the code cell and select Run.
 
This code creates a new trial and associates it with the Experiment you created in Step 4.  
single_gpu_trial = Trial.create(
    trial_name = 'sagemaker-single-gpu-training', 
    experiment_name = training_experiment.experiment_name,
    sagemaker_boto_client = sm,
)

trial_comp_name = 'single-gpu-training-job'
experiment_config = {"ExperimentName": training_experiment.experiment_name, 
                       "TrialName": single_gpu_trial.trial_name,
                       "TrialComponentDisplayName": trial_comp_name}

Each trial is an iteration of your end-to-end training job. In addition to the training job, a trial can also track preprocessing jobs, postprocessing jobs, datasets, and other metadata. A single experiment can include multiple trials which makes it easy for you to track multiple iterations over time within the Amazon SageMaker Studio Experiments pane.


b. In the left toolbar, choose Components and registries (triangle icon). Double-click the sagemaker-training-experiments to display the associated trials. The new trial sagemaker-single-gpu-training appears in the list.

c. On the File menu, choose New, then Text File. In the code editor, copy and paste the following TensorFlow code into the newly created file.
 
This script implements TensorFlow code to read the CIFAR-10 dataset and train a resnet50 model.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.layers import Input, Dense, Flatten
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.optimizers import Adam, SGD
import argparse
import os
import re
import time

HEIGHT = 32
WIDTH = 32
DEPTH = 3
NUM_CLASSES = 10

def single_example_parser(serialized_example):
    """Parses a single tf.Example into image and label tensors."""
    # Dimensions of the images in the CIFAR-10 dataset.
    # See http://www.cs.toronto.edu/~kriz/cifar.html for a description of the
    # input format.
    features = tf.io.parse_single_example(
        serialized_example,
        features={
            'image': tf.io.FixedLenFeature([], tf.string),
            'label': tf.io.FixedLenFeature([], tf.int64),
        })
    image = tf.decode_raw(features['image'], tf.uint8)
    image.set_shape([DEPTH * HEIGHT * WIDTH])

    # Reshape from [depth * height * width] to [depth, height, width].
    image = tf.cast(
        tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
        tf.float32)
    label = tf.cast(features['label'], tf.int32)
    
    image = train_preprocess_fn(image)
    label = tf.one_hot(label, NUM_CLASSES)
    
    return image, label

def train_preprocess_fn(image):

    # Resize the image to add four extra pixels on each side.
    image = tf.image.resize_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)

    # Randomly crop a [HEIGHT, WIDTH] section of the image.
    image = tf.image.random_crop(image, [HEIGHT, WIDTH, DEPTH])

    # Randomly flip the image horizontally.
    image = tf.image.random_flip_left_right(image)
    return image

def get_dataset(filenames, batch_size):
    """Read the images and labels from 'filenames'."""
    # Repeat infinitely.
    dataset = tf.data.TFRecordDataset(filenames).repeat().shuffle(10000)

    # Parse records.
    dataset = dataset.map(single_example_parser, num_parallel_calls=tf.data.experimental.AUTOTUNE)

    # Batch it up.
    dataset = dataset.batch(batch_size, drop_remainder=True)
    return dataset

def get_model(input_shape, learning_rate, weight_decay, optimizer, momentum):
    input_tensor = Input(shape=input_shape)
    base_model = keras.applications.resnet50.ResNet50(include_top=False,
                                                          weights='imagenet',
                                                          input_tensor=input_tensor,
                                                          input_shape=input_shape,
                                                          classes=None)
    x = Flatten()(base_model.output)
    predictions = Dense(NUM_CLASSES, activation='softmax')(x)
    model = Model(inputs=base_model.input, outputs=predictions)
    return model

def main(args):
    # Hyper-parameters
    epochs       = args.epochs
    lr           = args.learning_rate
    batch_size   = args.batch_size
    momentum     = args.momentum
    weight_decay = args.weight_decay
    optimizer    = args.optimizer

    # SageMaker options
    training_dir   = args.training
    validation_dir = args.validation
    eval_dir       = args.eval

    train_dataset = get_dataset(training_dir+'/train.tfrecords',  batch_size)
    val_dataset   = get_dataset(validation_dir+'/validation.tfrecords', batch_size)
    eval_dataset  = get_dataset(eval_dir+'/eval.tfrecords', batch_size)
    
    input_shape = (HEIGHT, WIDTH, DEPTH)
    model = get_model(input_shape, lr, weight_decay, optimizer, momentum)
    
    # Optimizer
    if optimizer.lower() == 'sgd':
        opt = SGD(lr=lr, decay=weight_decay, momentum=momentum)
    else:
        opt = Adam(lr=lr, decay=weight_decay)

    # Compile model
    model.compile(optimizer=opt,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    
    # Train model
    history = model.fit(train_dataset, steps_per_epoch=40000 // batch_size,
                        validation_data=val_dataset, 
                        validation_steps=10000 // batch_size,
                        epochs=epochs)
                        
    
    # Evaluate model performance
    score = model.evaluate(eval_dataset, steps=10000 // batch_size, verbose=1)
    print('Test loss    :', score[0])
    print('Test accuracy:', score[1])
    
    # Save model to model directory
    model.save(f'{os.environ["SM_MODEL_DIR"]}/{time.strftime("%m%d%H%M%S", time.gmtime())}', save_format='tf')


#%%
if __name__ == "__main__":
    
    parser = argparse.ArgumentParser()
    # Hyper-parameters
    parser.add_argument('--epochs',        type=int,   default=10)
    parser.add_argument('--learning-rate', type=float, default=0.01)
    parser.add_argument('--batch-size',    type=int,   default=128)
    parser.add_argument('--weight-decay',  type=float, default=2e-4)
    parser.add_argument('--momentum',      type=float, default='0.9')
    parser.add_argument('--optimizer',     type=str,   default='sgd')

    # SageMaker parameters
    parser.add_argument('--model_dir',        type=str)
    parser.add_argument('--training',         type=str,   default=os.environ['SM_CHANNEL_TRAINING'])
    parser.add_argument('--validation',       type=str,   default=os.environ['SM_CHANNEL_VALIDATION'])
    parser.add_argument('--eval',             type=str,   default=os.environ['SM_CHANNEL_EVAL'])
    
    args = parser.parse_args()
    main(args)

d. On the File menu, choose Rename File. In the New Name box, copy and paste cifar10-training-sagemaker.py and choose Rename. (Make sure that the new extension is .py and not .txt). Then, choose File and Save Python File.

Step 6. Run the TensorFlow training job and visualize the results

In this step, you run a TensorFlow training job using Amazon SageMaker. Training models is easy with Amazon SageMaker. You specify the location of your dataset in Amazon S3 and type of training instance, and then Amazon SageMaker manages the training infrastructure for you.


Complete the following steps to run the TensorFlow training job and then visualize the results.

Note: For more information, see Use TensorFlow with Amazon SageMaker in the Amazon SageMaker documentation.


a. In your Jupyter Notebook, copy and paste the following code block into the code cell and select Run. Then, take a closer look at the code.
 
Note: If a ResourceLimitExceeded appears, change the instance type to ml.c5.xlarge.
 
Note: You can safely ignore any deprecation warnings (for example, sagemaker.deprecations:train_instance_type has been renamed...). This warnings are due to version changes and do not cause any training failures.
from sagemaker.tensorflow import TensorFlow

hyperparams={'epochs'       : 30,
             'learning-rate': 0.01,
             'batch-size'   : 256,
             'weight-decay' : 2e-4,
             'momentum'     : 0.9,
             'optimizer'    : 'adam'}

bucket_name = sagemaker_session.default_bucket()
output_path = f's3://{bucket_name}/jobs'
metric_definitions = [{'Name': 'val_acc', 'Regex': 'val_acc: ([0-9\\.]+)'}]

tf_estimator = TensorFlow(entry_point          = 'cifar10-training-sagemaker.py', 
                          output_path          = f'{output_path}/',
                          code_location        = output_path,
                          role                 = role,
                          train_instance_count = 1, 
                          train_instance_type  = 'ml.g4dn.xlarge',
                          framework_version    = '1.15.2', 
                          py_version           = 'py3',
                          script_mode          = True,
                          metric_definitions   = metric_definitions,
                          sagemaker_session    = sagemaker_session,
                          hyperparameters      = hyperparams)

job_name=f'tensorflow-single-gpu-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
tf_estimator.fit({'training'  : datasets,
                  'validation': datasets,
                  'eval'      : datasets},
                 job_name = job_name,
                 experiment_config=experiment_config)

This code includes three parts:

  • Specifies training job hyperparameters
  • Calls an Amazon SageMaker Estimator function and provides training job details (name of the training script, what instance type to train on, framework version, etc.)
  • Calls the fit function to initiate the training job

Amazon SageMaker automatically provisions the requested instances, downloads the dataset, pulls the TensorFlow container, downloads the training script, and starts training.

In this example, you submit an Amazon SageMaker training job to run on ml.g4dn.xlarge which is a GPU instance. Deep learning training is computationally intensive and GPU instances are recommended for getting results faster.   

After training is complete, you should see final accuracy results, training time, and billable time.

b. View the training summary. In the left toolbar, choose Components and registries (triangle icon). Double-click sagemaker-training-experiments, then double-click sagemaker-single-gpu-training, and double-click the newly created single-gpu-training-job Trial Component for your training job. Choose Metrics.

c. Visualize the results. Choose Charts then Add chart. In the Chart Properties pane, make the following selections:

  • Chart type: Line
  • X-axis dimension: Epoch
  • Y-axis: val_acc_EVAL_avg

You should see a graph showing the change in evaluation accuracy as training progresses, ending with the final accuracy in Step 6a.

Step 7. Tune the model with Amazon SageMaker automatic model tuning

In this step you run an Amazon SageMaker automatic model tuning job to find the best hyperparameters and improve upon the training accuracy obtained in Step 6. To run a model tuning job, you need to provide Amazon SageMaker with hyperparameter ranges rather than fixed values, so that it can explore the hyperparameter space and automatically find the best values for you.

Complete the following steps to run the automatic model tuning job.

Note: For more information, see Perform Automatic Model Tuning in the Amazon SageMaker documentation.


a. In your Jupyter Notebook, copy and paste the following code block into the code cell and select Run. Then, take a closer look at the code.
 
Note: If a ResourceLimitExceeded appears, change the instance type to ml.c5.xlarge.

Note: You can safely ignore any deprecation warnings (for example, sagemaker.deprecations:train_instance_type has been renamed...). This warnings are due to version changes and do not cause any training failures.

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {
    'epochs'        : IntegerParameter(5, 30),
    'learning-rate' : ContinuousParameter(0.001, 0.1, scaling_type='Logarithmic'), 
    'batch-size'    : CategoricalParameter(['128', '256', '512']),
    'momentum'      : ContinuousParameter(0.9, 0.99),
    'optimizer'     : CategoricalParameter(['sgd', 'adam'])
}

objective_metric_name = 'val_acc'
objective_type = 'Maximize'
metric_definitions = [{'Name': 'val_acc', 'Regex': 'val_acc: ([0-9\\.]+)'}]

tf_estimator = TensorFlow(entry_point          = 'cifar10-training-sagemaker.py', 
                          output_path          = f'{output_path}/',
                          code_location        = output_path,
                          role                 = role,
                          train_instance_count = 1, 
                          train_instance_type  = 'ml.g4dn.xlarge',
                          framework_version    = '1.15', 
                          py_version           = 'py3',
                          script_mode          = True,
                          metric_definitions   = metric_definitions,
                          sagemaker_session    = sagemaker_session)

tuner = HyperparameterTuner(estimator             = tf_estimator,
                            objective_metric_name = objective_metric_name,
                            hyperparameter_ranges = hyperparameter_ranges,
                            metric_definitions    = metric_definitions,
                            max_jobs              = 16,
                            max_parallel_jobs     = 8,
                            objective_type        = objective_type)

job_name=f'tf-hpo-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
tuner.fit({'training'  : datasets,
           'validation': datasets,
           'eval'      : datasets},
            job_name = job_name)

This code includes four parts:

  • Specifies range of values for hyperparameters. These could be integer ranges (eg. Epoch numbers), continuous ranges (eg. Learning rate), or categorical values (eg. Optimizer type sgd or adam).
  • Calls an Estimator function similar to the one in Step 6
  • Creates a HyperparameterTuner object with hyperparameter ranges, maximum number of jobs, and number of parallel jobs to run
  • Calls the fit function to initiate hyperparameter tuning job

Note: You can reduce the max_jobs variable from 16 to a smaller number to save tuning job costs. However, by reducing the number of tuning jobs, you reduce the chances of finding a better model. You can also reduce the max_parallel_jobs variable to a number less than or equal to max_jobs. You can get results faster when max_parallel_jobs is equal to max_jobs. Make sure that max_parallel_jobs is lower than the instance limits of your AWS account to avoid running into resource errors.


b. View the best hyperparameters. Open the Amazon SageMaker Console, and in the left navigation pane, under Training, choose Hyperparameter tuning jobs, choose the tuning job,  then choose Best training job. You’ll see an improvement in the training accuracy (80%) compared to results in Step 6 (60%).
 
Note: Your results may vary. You can further improve your results by increasing max_jobs, relaxing the hyperparameter ranges and exploring other model architectures

Step 8. Clean up

In this step, you terminate the resources you used in this lab.

Important: Terminating resources that are not actively being used reduces costs and is a best practice. Not terminating your resources will result in charges to your account.


Stop training jobs:

  1. Open the Amazon SageMaker Console.
  2. In the left navigation pane under Training, choose Training Jobs.
  3. Confirm that there are no training jobs that have a status of In Progress. For any in progress training jobs, you can either wait for the job to finish training, or select the training job name and choose Stop.

(Optional) Clean up all training artifacts: If you want to clean up all training artifacts (models, preprocessed data sets, etc.), in your Jupyter notebook, copy and paste the following code and choose Run.

Note: Make sure to replace ACCOUNT_NUMBER with your account number.

!aws s3 rm --recursive s3://sagemaker-us-west-2-ACCOUNT_NUMBER/datasets/cifar10-dataset
!aws s3 rm --recursive s3://sagemaker-us-west-2-ACCOUNT_NUMBER/jobs

Conclusion

Congratulations! You created, trained, and tuned a TensorFlow deep learning model with Amazon SageMaker.
 
You can continue your machine learning journey with SageMaker by following the next steps section below.

 

Was this page helpful?

Next steps