AWS Machine Learning Blog

Using TensorFlow eager execution with Amazon SageMaker script mode

In this blog post, I’ll discuss how to use Amazon SageMaker script mode to train models with TensorFlow’s eager execution mode. Eager execution is the future of TensorFlow; although it is available now as an option in recent versions of TensorFlow 1.x, it will become the default mode of TensorFlow 2. I’ll provide a brief overview of script mode and eager execution, and then present a typical regression task scenario. Next, I’ll describe a workflow that solves this task using script mode and eager execution together. The notebook and related code for this blog post is available on GitHub. Let’s begin with a look at script mode.

Amazon SageMaker script mode

Amazon SageMaker provides APIs and prebuilt containers that make it easy to train and deploy models using several popular machine learning (ML) and deep learning frameworks such as TensorFlow. You can use Amazon SageMaker to train and deploy models using custom TensorFlow code without having to worry about building containers or managing the underlying infrastructure. The Amazon SageMaker Python SDK TensorFlow estimators, and the Amazon SageMaker open source TensorFlow container, make it easy to write a TensorFlow script and then simply run it in Amazon SageMaker. The preferred way to leverage these capabilities is to use script mode.

Amazon SageMaker script mode was launched around AWS re:Invent 2018. It replaces the previous legacy mode, which requires structuring training code around a defined interface of specific functions and the TensorFlow Estimator API. Starting with TensorFlow version 1.11, you can use script mode with Amazon SageMaker prebuilt TensorFlow containers to train TensorFlow models with the same kind of training script you would use outside SageMaker. Your script mode code does not need to comply with any specific Amazon SageMaker-defined interface or use any specific TensorFlow API.

Although a script mode training script is very similar to a training script you might use outside of Amazon SageMaker, you also can access useful properties about the Amazon SageMaker training environment through various environment variables you set. For example, these environment variables are used to specify the dataset location (local or in Amazon S3) and hyperparameters for the algorithm. As shown in the following code snippet, if your code is written in Python, typically the code that does that actual training is placed in a main guard (if __name__ == “__main__”) since Amazon SageMaker imports the script. The main guard prevents the code from being run until Amazon SageMaker is ready to do so.


if __name__ == "__main__":
        
    args, _ = parse_args()
    
    x_train, y_train = get_train_data(args.train)
    x_test, y_test = get_test_data(args.test)
    
    device = '/cpu:0' 
    print(device)
    batch_size = args.batch_size
    epochs = args.epochs
    print('batch_size = {}, epochs = {}'.format(batch_size, epochs))

    with tf.device(device):
        
        model = get_model()
        optimizer = tf.train.GradientDescentOptimizer(0.1)
        model.compile(optimizer=optimizer, loss='mse')    
        model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
                  validation_data=(x_test, y_test))

        # evaluate on test set
        scores = model.evaluate(x_test, y_test, batch_size, verbose=2)
        print("Test MSE :", scores)

        # save checkpoint for locally loading in notebook
        saver = tfe.Saver(model.variables)
        saver.save(args.model_dir + '/weights.ckpt')
        # create a separate SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving
        tf.contrib.saved_model.save_keras_model(model, args.model_dir)

In the flow of a typical script mode script, first the command line arguments are fetched, then the data is loaded, and then the model is set up. Next is the actual training, either using convenience methods supplied by tf.keras, or using a training loop that you define. Saved models should go into the /opt/ml/model directory of the container, from which they will be automatically uploaded to Amazon S3 prior to teardown of the container when training is completed.

 TensorFlow eager execution

Eager execution is the future of TensorFlow, and it’s a major paradigm shift. Recently introduced as a more intuitive and dynamic alternative to the original graph mode of TensorFlow, eager execution will become the default mode of TensorFlow 2.

The interface of eager execution is imperative:  Operations are executed immediately, rather than being used to build a static computational graph. Advantages of eager execution include a more intuitive interface with natural control flow and less boilerplate, simplified debugging, and support for dynamic models and almost all of the available TensorFlow operations. Another key difference between eager execution and graph mode is that for graph mode, program state such as variables is globally stored, and a state object’s lifetime is managed by a tf.Session object. By contrast, for eager execution the lifetime of state objects is determined by the lifetime of their corresponding Python objects. This makes it easier to reason about how your code will work, as well as debug it.

In addition to these advantages, eager execution also works with the tf.keras API to make rapid prototyping even easier. If tf.keras is used, the model can be built using the tf.keras functional API or using a subclass from tf.Keras.Model. After model setup with tf.keras, you can simply compile it, call the fit method to train, evaluate on a test set, and save the model. If you’re not using tf.keras, you define your own training loop, and use tf.GradientTape to record operations for later automatic differentiation. Whether you use tf.keras or not, you can now use eager execution with Amazon SageMaker’s prebuilt TensorFlow containers, which was not possible with legacy mode but is now enabled by script mode.

 Workflow initial steps:  Data preprocessing and local mode

To demonstrate how eager execution works with script mode, we’ll focus on presenting a relatively complete workflow within Amazon SageMaker. The workflow includes local and hosted training, as well as inference, in the context of a straightforward regression task. The task involves predicting house prices based on the well-known, public Boston Housing dataset. This dataset contains 13 features that apply to the housing stock of towns in the Boston area, including average number of rooms, accessibility to radial highways, adjacency to the Charles River, etc. To follow along with this blog post, we recommend that you set up an Amazon SageMaker notebook instance. If you don’t have one already see the Amazon SageMaker Developer Guide for instructions. You can upload the notebook and related code from the GitHub repository for this blog post.

After preprocessing the data and writing a training script, the next step is to make sure your code is working as expected. For example, you might train the model for only a few epochs, or train the model on only small sample of the dataset rather than the full dataset. A convenient way to do this is to use Amazon SageMaker local mode training. To train in local mode, it is necessary to have Docker Compose or NVIDIA-Docker-Compose (for GPU) installed in the notebook instance. The example code has a setup shell script you can run to check this and install missing software, if any.

The following code snippet shows how to set up a TensorFlow Estimator and then starts a training job for only a few epochs to confirm that the code is working. One of the key parameters for an Estimator is the train_instance_type, which is the kind of hardware on which the training will run. In the case of local mode, we simply set this parameter to ‘local’ to invoke local mode training on the CPU, or to ‘local_gpu’ if the instance has a GPU. Other parameters of note are the algorithm’s hyperparameters, which are passed in as a dictionary, and a Boolean parameter, which indicates that we are using script mode.

import sagemaker
from sagemaker.tensorflow import TensorFlow

model_dir = '/opt/ml/model'
train_instance_type = 'local'
hyperparameters = {'epochs': 10, 'batch_size': 128}
local_estimator = TensorFlow(entry_point='train.py',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-eager-scriptmode-bostonhousing',
                       framework_version='1.12.0',
                       py_version='py3',
                       script_mode=True)

inputs = {'train': f'file://{train_dir}',
          'test': f'file://{test_dir}'}

local_estimator.fit(inputs)

To start a training job, we call local_estimator.fit(inputs), where inputs is a dictionary where the keys and named channels have values pointing to the dataset’s location. The local_estimator.fit(inputs) invocation downloads locally to the notebook instance a prebuilt TensorFlow container with TensorFlow for Python 3, CPU version. It then simulates an Amazon SageMaker training job. When training starts, the TensorFlow container executes the train.py script, passing hyperparameters as command line script arguments. You can confirm that the script is working by viewing the logs that are output in the notebook cell, including metrics for each epoch of training.

After we’ve confirmed with local mode that the code is working, we also have a model checkpoint saved in Amazon S3 that we can retrieve and load anywhere, including our notebook instance. As a further sanity check, we can then use the model to make predictions and compare them with the test set. If you do so for the Boston Housing dataset, keep in mind that the housing values are in units of $1000s. (In case you’re wondering why the actual values seem relatively low compared to today’s big city housing prices: the paper referencing the dataset was originally published in 1978.) After we confirm that our code is working, let’s move on to hosted training.

Hosted training in Amazon SageMaker

Hosted training is preferred for doing complete training on the full dataset, especially for large-scale, distributed training. When we did the local mode training, the data was accessed from local directories. Keep in mind that Amazon S3 also can be used to hold training data for local mode if you would prefer to keep all of your data in one place. However, before starting hosted training, the data must be uploaded to an Amazon S3 bucket, as shown in the notebook.

After uploading the data, we’re ready to set up an Amazon SageMaker Estimator object. It is similar to the local mode Estimator, except (1) the train_instance_type has been set to a specific instance type instead of ‘local’ for local mode, and (2) the inputs argument to the fit invocation are set to Amazon S3 locations. Also, since we’re ready to do full-scale training, the number of epochs has been increased. With these changes, we simply call the fit method again to start the actual hosted training.

train_instance_type = 'ml.c4.xlarge'
hyperparameters = {'epochs': 30, 'batch_size': 128}

estimator = TensorFlow(entry_point='train.py',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-eager-scriptmode-bostonhousing',
                       framework_version='1.12.0',
                       py_version='py3',
                       script_mode=True)

estimator.fit(inputs)

predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')
results = predictor.predict(x_test[:10])['predictions'] 

As with local mode training, hosted training produces a model saved in Amazon S3 that we can retrieve and load. We can then make predictions and compare them with the test set. This also demonstrates the modularity of Amazon SageMaker. Having trained the model in Amazon SageMaker, you can now take the model out of Amazon SageMaker and run it anywhere.

Alternatively, you can deploy the model using the Amazon SageMaker hosted endpoints functionality. To do so with TensorFlow, the model must be saved in the TensorFlow SavedModel format rather than a model checkpoint format, as required by TensorFlow Serving. As shown in the last code snippet above, deployment to a hosted endpoint is simply accomplished with one line of code by calling the Estimator’s deploy method.

Conclusion

One of the goals of Amazon SageMaker is to enable data scientists and developers to quickly and easily build, train, and deploy ML models. When script mode is combined with TensorFlow eager execution mode, it’s easy to set up a workflow for rapid prototyping to large-scale training and deployment you can use for a wide variety of data science projects. If you prefer the TensorFlow original static computational graph mode, you also can use script mode. It’s your choice, and it’s just one of the many flexible options provided by Amazon SageMaker.


About the Author

Brent Rabowsky focuses on data science at AWS, and leverages his expertise to help AWS customers with their own data science projects.