AWS Machine Learning Blog

Train and deploy Keras models with TensorFlow and Apache MXNet on Amazon SageMaker

Keras is a popular and well-documented open source library for deep learning, while Amazon SageMaker provides you with easy tools to train and optimize machine learning models. Until now, you had to build a custom container to use both, but Keras is now part of the built-in TensorFlow environments for TensorFlow and Apache MXNet. Not only does this simplify the development process, it also allows you to use standard Amazon SageMaker features like script mode or automatic model tuning.

Keras’s excellent documentation, numerous examples, and active community make it a great choice for beginners and experienced practitioners alike. The library provides a high-level API that makes it easy to build all kind of deep learning architectures, with the option to use different backends for training and prediction: TensorFlow, Apache MXNet, and Theano.

In this post, I show you how to train and deploy Keras 2.x models on Amazon SageMaker, using the built-in TensorFlow environments for TensorFlow and Apache MXNet. In the process, you also learn the following:

  • To run the same Keras code on Amazon SageMaker that you run on your local machine, use script mode.
  • To optimize hyperparameters, launch automatic model tuning.
  • Deploy your models with Amazon Elastic Inference.

The Keras example

This example demonstrates training a simple convolutional neural network on the Fashion MNIST dataset. This dataset replaces the well-known MNIST dataset. It has the same number of classes (10), samples (60,000 for training, 10,000 for validation), and image properties (28×28 pixels, black and white). But it’s also much harder to learn, which makes for a more interesting challenge.

First, set up TensorFlow as your Keras backend (and switch to Apache MXNet later on). For more information, see the script.

The process is straightforward:

  • Grab optional parameters from the command line, or use default values if they’re missing.
  • Download the dataset and save it to the /data directory.
  • Normalize the pixel values, and one hot encode labels.
  • Build the convolutional neural network.
  • Train the model.
  • Save the model to TensorFlow Serving format for deployment.

Positioning your image channels can be tricky. Black and white images have a single channel (black), while color images have three channels (red, green, and blue). The library expects data to have a well-defined shape when training a model, describing the batch size, the height and width of images, and the number of channels. TensorFlow specifically requires the input shape formatted as (batch size, width, height, channels), with channels last. Meanwhile, MXNet expects (batch size, channels, width, height), with channels first. To avoid training issues created by using the wrong shape, I add a few lines of code to identify the active setting and reshape the dataset to compensate.

Now check that this code works by running it on a local machine, without using Amazon SageMaker.

$ python
Using TensorFlow backend.
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
<output removed>
Validation loss    : 0.2472819224089384
Validation accuracy: 0.9126

Training and deploying the Keras model

You must make a few minimal changes, but script mode does most of the work for you. Before invoking your code inside the TensorFlow environment, Amazon SageMaker sets four environment variables

  • SM_NUM_GPUS—The number of GPUs present on the instance.
  • SM_MODEL_DIR— The output location for the model.
  • SM_CHANNEL_TRAINING— The location of the training dataset.
  • SM_CHANNEL_VALIDATION—The location of the validation dataset.

You can use these values in your training code with just a simple modification:

parser.add_argument('--gpu-count', type=int, default=os.environ['SM_NUM_GPUS'])
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--training', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])

What about hyperparameters? No work needed there. Amazon SageMaker passes them as command line arguments to your code.

For more information, see the updated script,

Training on Amazon SageMaker

After deploying your Keras model, you can begin training on Amazon SageMaker. For more information, see the Fashion MNIST-SageMaker.ipynb notebook.

The process is straightforward:

  • Download the dataset.
  • Define the training and validation channels.
  • Configure the TensorFlow estimator, enabling script mode and passing some hyperparameters.
  • Train, deploy, and predict.

In the training log, you can see how Amazon SageMaker sets the environment variables and how it invokes the script with the three hyper parameters defined in the estimator:

/usr/bin/python --batch-size 256 --epochs 20 --learning-rate 0.01 --model_dir s3://sagemaker-eu-west-1-123456789012/sagemaker-tensorflow-scriptmode-2019-05-16-14-11-19-743/model

Because you saved your model in TensorFlow Serving format, Amazon SageMaker can deploy it just like any other TensorFlow model by calling the deploy() API on the estimator. Finally, you can grab some random images from the dataset and predict them with the model you just deployed.

Script mode makes it easy to train and deploy existing TensorFlow code on Amazon SageMaker. Just grab those environment variables, add command line arguments for your hyperparameters, save the model in the right place, and voilà!

Switching to the Apache MXNet backend

As mentioned earlier, Keras also supports MXNet as a backend. Many customers find that it trains faster than TensorFlow, so you may want to give it a shot.

Everything discussed above still applies (script mode, etc.). You only make two changes:

  • Use channels_first.
  • Save the model in MXNet format, creating an extra file (model-shapes.json) required to load the model for prediction.

For more information, see the training code for MXNet.

You can find the Amazon SageMaker steps in the notebook. Apache MXNet uses virtually the same process I just reviewed, aside from using the MXNet estimator.

Automatic model tuning on Keras

Automatic model tuning is a technique that helps you find the optimal hyperparameters for your training job, that is, the hyperparameters that maximize validation accuracy.

You have access to this feature by default because you’re using the built-in estimators for TensorFlow and MXNet. For the sake of brevity, I only show you how to use it with Keras-TensorFlow, but the process is identical for Keras-MXNet.

First, define the hyperparameters you’d like to tune, and their ranges. How about all of them? Thanks to script mode, your parameters are passed as command line arguments, allowing you to tune anything.

hyperparameter_ranges = {
    'epochs':        IntegerParameter(20, 100),
    'learning-rate': ContinuousParameter(0.001, 0.1, scaling_type='Logarithmic'), 
    'batch-size':    IntegerParameter(32, 1024),
    'dense-layer':   IntegerParameter(128, 1024),
    'dropout':       ContinuousParameter(0.2, 0.6)

When configuring automatic model tuning, define which metric to optimize on. Amazon SageMaker supports predefined metrics that it can read automatically from the training log for built-in algorithms (XGBoost, etc.) and frameworks (TensorFlow, MXNet, etc.). That’s not the case for Keras. Instead, you must tell Amazon SageMaker how to grab your metric from the log with a simple regular expression:

objective_metric_name = 'val_acc'
objective_type = 'Maximize'
metric_definitions = [{'Name': 'val_acc', 'Regex': 'val_acc: ([0-9\\.]+)'}]

Then, you define your tuning job, run it, and deploy the best model. No difference here.

Advanced users may insist on using early stopping to avoid overfitting, and they would be right. You can implement this in Keras using a built-in callback (keras.callbacks.EarlyStopping). However, this also creates difficulty in automatic model tuning.

You need Amazon SageMaker to grab the metric for the best epoch, not the last epoch. To overcome this, define a custom callback to log the best validation accuracy. Modify the regular expression accordingly so that Amazon SageMaker can find it in the training log.

For more information, see the 02-fashion-mnist notebook.


I covered a lot of ground in this post. You now know how to:

  • Train and deploy Keras models on Amazon SageMaker, using both the TensorFlow and the Apache MXNet built-in environments.
  • Use script mode to use your existing Keras code with minimal change.
  • Perform automatic model tuning on Keras metrics.

Thank you very much for reading. I hope this was useful. I always appreciate comments and feedback, either here or more directly on Twitter.

About the Author

Julien is the Artificial Intelligence & Machine Learning Evangelist for EMEA. He focuses on helping developers and enterprises bring their ideas to life. In his spare time, he reads the works of JRR Tolkien again and again.