Transfer learning for custom labels using a TensorFlow container and “bring your own algorithm” in Amazon SageMaker
Data scientists and developers can use the Amazon SageMaker fully managed machine learning service to build and train machine learning (ML) models, and then directly deploy them into a production-ready hosted environment.
In this blog post we’ll show you how to use Amazon SageMaker to do transfer learning using a TensorFlow container with our own code for training and inference.
Transfer learning is a well-known technique that is used in computer vision problems to re-train an existing trained neural network, such as AlexNet or ResNet, for additional custom labels. Amazon SageMaker also supports transfer learning for image classification through the built-in image classification algorithm and you could use your own labelled image data to re-train a ResNet network. See this image classification documentation for more details on Amazon SageMaker. To understand when to use transfer learning and related guidelines, refer to this.
Although the Amazon SageMaker built-in image classification algorithm is great for a variety of use cases, you might have scenarios where you need a different combination of the pre-trained network and the image data on which it has been trained. For example, some of the criteria to keep in mind are – similarity of new data set to the original data set, size of the new data set, number of labels required, accuracy of the model, size of the trained model, and, last but not least, the amount of compute power needed to re-train. If, for example, you are trying to deploy the trained model on a handheld device, you might be better off with a smaller footprint model such as MobileNet. Alternatively, if you want more compute-efficient models, Xception would be better than VGG16 or Inception.
In this blog post, we take an inception v3 network pre-trained on an ImageNet dataset and re-train it using the Caltech-256 dataset (Griffin, G. Holub, AD. Perona, P. The Caltech 256. Caltech Technical Report). Amazon SageMaker makes it very easy to bundle your own container and import it into Amazon Elastic Container Registry (Amazon ECR). Alternatively, you could use the container provided by Amazon SageMaker: https://github.com/aws/sagemaker-tensorflow-containers. We will customize the Amazon SageMaker TensorFlow container with our own transfer learning code in TensorFlow framework. Then we’ll import this container into Amazon ECR, and use it for model training and inferencing.
I hope this gives you sufficient background, now let us start putting it together.
Launching and preparing the environment
We’ll leverage the Jupyter Notebook instance provided by Amazon SageMaker to customize the provided TensorFlow container. We will import the container to this notebook environment before registering into Amazon ECR. The same container will be used both for training and inferencing. Note that we’ll build on a previous blog post, which explains the basics of importing your own container. The difference is that in this blog post we are customizing the TensorFlow container.
Launch an Amazon SageMaker notebook instance
Log in to the AWS Management Console and go to the Amazon SageMaker console. The notebook instance has all the building blocks to build a custom container and the Docker container Image.
Open the Amazon SageMaker Dashboard. Choose Create notebook instance.
For the purposes of this blog post, we will do the following:
- Place the notebook instance in a subnet inside a VPC, which has internet access.
- You can pick up any of the instance options from the drop-down list, but we recommend ml.m4.xlarge at the minimum.
- Create a new IAM role or attach an existing one. Make sure that this IAM role allows Amazon ECR full access and S3 Bucket access. This access is in addition to the default access that you get when creating a new IAM role for Amazon SageMaker. For this blog post, these permissions should be sufficient to get started, however, if you want to further narrow down the scope of permissions and limit them to the resources, see Amazon SageMaker Roles in the documentation.
- Also, create a security group in this VPC, which has at least port 80 and 8888 opened.
- Enable internet access for this notebook. For purposes of this blog, this should be sufficient, but for additional security considerations of a notebook instance, review Notebook Instance Security in the documentation.
- Keep the rest of the options at the default settings.
Choose Create notebook instance and you should see an instance launching. (It’s in Pending status.) –
Wait for the status to change from Pending to InService.
After the notebook instance is in service, choose Open. Now you have a fully working Jupyter notebook with variety of different environments pre-configured and pre-built for you.
A screen similar to the following opens.
You can see all the pre-configured environments by choosing on New in the top-right corner.
Choose Terminal to launch a terminal.
We will be using this terminal screen for most of our workshop.
Get the Amazon SageMaker provided TensorFlow container
The containers for TensorFlow, MXNet, and Chainer have been provided. We will customize the TensorFlow container with our own training and inference code. To begin, we will first clone the TensorFlow container Git repository from AWS.
In the terminal window, execute the following command:
This should clone the repository. See the following screenshot:
Now our environment is ready to customize the container.
Customize the TensorFlow container for transfer learning
Let’s clone the source code required for the TensorFlow Docker. Go to the directory containing the sample Dockerfiles:
Execute the following to clone the TensorFlow code for transfer learning and inferencing:
After copying you should be able to see a structure as follows:
Note that our helper scripts and actual code are available in a folder called tfblog, which was downloaded from the Git repository. This folder has a file called Dockerfile_blog.cpu, which is a customized version of Dockerfile.cpu which was provided at the TensorFlow container repository. The only major changes are the following:
- Changed WORKDIR to /opt/program as our code for training and inferencing will be available from that relative path when the container is running.
- Some environment variables so that the STDOUT and STDERR messages are readily streamed to Amazon CloudWatch Logs.
- TensorFlow code for transfer learning, inferencing, and helper code for web access is copied over to the container.
Apart from these changes the rest of the Dockerfile is same as the one available on the Amazon SageMaker repository here.
Now that we have the environment and code necessary for training and inferencing, it’s time to build the container.
Building the Amazon SageMaker-provided TensorFlow container
We will now build and push the Docker container to the container registry:
Check the files present in tfblog/tensorflow_code. Look for important files, such as train, which is used while training the model. The Amazon SageMaker invokes Docker container so that this script is called during training process. Similarly, there is a script called predictor.py that loads the trained model parameters and through a wrapper function in serve waits for prediction queries. It parses the input data (image/ text/ csv) and calls the model function. The predictions are then replied back to the user. Note that the code in train and predictor.py is taken from the image retraining part and labeling of images part of the code. The primary changes ensure that the inference code runs as a Flask application instance and responds to ping and invocation requests from Amazon SageMaker. This sample notebook shows you how to package the Docker container and the folder structure required for it to work in Amazon SageMaker.
Now build the container and give it a name, such ‘blog_img_5’or a name that you prefer. This image will be registered in ECR:
Execute the script and you should see a screen similar to the following:
Your container is built and registered with ECR. You can verify this:
Test the container locally for training and inference
Now that the Dockerfile is ready with training and prediction code, it’s time to upload some sample data and test out the model locally before publishing to Amazon SageMaker. Ensure that the training data is placed in this path because this folder path is referenced by the container. The helper script for initiating local training will mount this path to the container instance (see https://github.com/amitkshgit/tfblog/blob/master/local_test/train_local.sh).
You can use the Caltech-256 dataset, as mentioned earlier.
Make sure that the folder structure has all the images belonging to a label grouped in their own folder:
Now, let’s test the training locally from within the SageMaker notebook instance. Go to folder called local_test, which has utilities for building, deploying, and running predictions locally before publishing into Amazon SageMaker.
This is very useful for testing and debugging purposes. Note the helper script called train_local.sh and how it launches the container, mounts the training data path, and passes ‘train’ as the entry point.
Copy a small amount of data to test the functionality ( for example, to ensure that ‘train’ and ‘predict’ are invoked) and not the training accuracy.
To start the local training:
You should see the training progress kicked-off. TensorFlow first downloads the InceptionV3 model, which has already been trained on ImageNet, and starts training the new labels from our sample test data.
Using the container through the Amazon SageMaker console
Now that we have locally tested and validated the functionality of the container, we can use it from the Amazon SageMaker console because its image has already been registered in ECR. Instead of copying data locally, Amazon SageMaker allows us to use the data from Amazon S3. In addition, we can view the logs in Amazon CloudWatch and track the progress of the training. Most importantly we can create the prediction endpoints, which can then be used for inferences.
Make a note of the ECR repository name because it will be needed for our training job as well as for creating the model endpoint.. Open the Amazon ECS console, choose Repositories, and you should see the Repository URL. Take a note of the entire URI.
Run training through Amazon SageMaker
Now we’ll create the training job from the Amazon SageMaker console. Choose Training Jobs, then Create Training Job.. You should see page similar to the one below – make sure to
- Provide the IAM role that was used while launching the notebook instance.
- Set Algorithm to Custom because we are using our own imported container now
- Provide the right ECR image URI for the Docker image that was imported for the TensorFlow container
- For the purposes of this blog post, the settings we just discussed are sufficient. For production purposes, if you want to read more about how training jobs can be protected by using VPC, read Protect Training Jobs by Using an Amazon Virtual Private Cloud in the documentation.
For Resource configuration, Instance type choose ml.m4.4xlarge, and increase Additional volume per instance (GB) (storage) to 20.
We are not changing hyperparameters in this scenario.
Next, we’ll load the training data and specify the right paths in Amazon S3. Make sure that the data is organized as files in the labelled folders and available in an Amazon S3 path. Provide this path to the S3 location as shown in the following screenshot –
Choose Create training job. It will take a few minutes to launch the training container and copy all of the data to it. The training does not start until the container is up and running, and the data has been loaded. A successful job submission should look like the following:
Choose the job name and go to the CloudWatch console to review the metrics that are being streamed. You should see all stdout messages being printed to CloudWatch Logs.
When the status of the job changes to Completed, this means that the job has uploaded the model to Amazon S3. We’ll take this model artifact to create the endpoints for prediction.
Deploy the trained model from Amazon S3 in Amazon SageMaker
Now we’ll deploy the model and call it for predictions. After the training job is complete, choose it, and you should see an option for Create Model:
Populate the page with the rest of the details for this model, as shown in the following screenshot:
Leave the rest of the options at the defaults. Notice that the console has automatically picked the right container image and model artifacts, which are the parameters for the trained model. For the purposes of this blog post these settings are sufficient, but if you want to read more about the protecting your models using VPC, see Protect Models by Using an Amazon Virtual Private Cloud in the documentation.
After the model has been deployed, we’ll create the endpoint configuration. Endpoint configurations specify which models to deploy, relative traffic weighting, and hardware requirements.
Choose the deployed model, and you should see options for Endpoint configurations.
Create the endpoint configuration. Then, choose Endpoints to create an endpoint that will use this endpoint configuration. Give a name to this endpoint and select the existing endpoint configuration:
Select the endpoint configuration that you created in the last step. After it has been created, choose it to see the progress of creation of this endpoint:
It will take a while to create the endpoint. But after it is created, it can be invoked from the notebook. After it is created, its status changes to InService.
Run predictions by calling the Amazon SageMaker endpoint
Open the conda_python2 environment:
Paste the following code into a cell:
If the mode is trained and deployed properly you should get a response similar to the following:
Clean-up and delete the resources
Remember to clean up the resources that were provisioned so you can avoid costs after running this scenario.
To delete the endpoint, highlight the endpoint that was created and select Delete:
Delete endpoint configuration
Similarly, delete the endpoint configuration by selecting the appropriate endpoint configuration and selecting Delete:
Next we’ll delete the model:
Delete notebook instance
Finally delete the notebook instance:
To conclude, I hope this blog post gave you some ideas for taking existing code in a framework, bundling it into an Amazon SageMaker-provided container, and porting it over to Amazon SageMaker. The containerized approach that this service provides makes it very easy for developers and data scientists to focus on their core strengths. You can use any framework you choose, but you still can leverage the wide array of AWS services, such as Amazon S3, CloudWatch, Amazon ECR, and IAM to scale and operate your infrastructure.
About the Author
Amit Sharma is an AWS solutions architect specializing in Analytics and Machine Leaning services. He helps various AWS customers and partners with technical guidance on related projects, thus enabling them to leverage services for their business benefit.