Training and deploying models using TensorFlow 2 with the Object Detection API on Amazon SageMaker

With the rapid growth of object detection techniques, several frameworks with packaged pre-trained models have been developed to provide users easy access to transfer learning. For example, GluonCV, Detectron2, and the TensorFlow Object Detection API are three popular computer vision frameworks with pre-trained models.

In this post, we use Amazon SageMaker to build, train, and deploy an EfficientDet model using the TensorFlow Object Detection API. It’s built on top of TensorFlow 2, which makes it easy to construct, train, and deploy object detection models.

It also provides the TensorFlow 2 Detection Model Zoo, which is a collection of pre-trained detection models we can use to accelerate our endeavor.

SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.

Walkthrough overview

This post demonstrates how to do the following:

Label images using SageMaker Ground Truth
Generate the dataset TFRecords and label map using SageMaker Processing
Fine-tune an EfficientDet model with TensorFlow 2 on SageMaker
Monitor your model training with TensorBoard and SageMaker Debugger
Deploy your model on a SageMaker endpoint and visualize predictions

Prerequisites

If you want to try out each step yourself, make sure that you have the following in place:

An AWS account
An Amazon Simple Storage Service (Amazon S3) bucket
A running SageMaker notebook instance
The following GitHub repository cloned to the SageMaker notebook instance

The code repo contains the following folders with step-by-step walkthrough via notebooks.

Preparing the data

You can follow this section by running the cells in this notebook.

The dataset

In this post, we use a dataset from iNaturalist.org and train a model to recognize bees from RGB images.

This dataset contains 500 images of bees that have been uploaded by iNaturalist users for the purposes of recording the observation and identification. We only use images that users have licensed under a CC0 license.

This dataset contains 500 images of bees that have been uploaded by iNaturalist users for the purposes of recording the observation and identification.

We placed the dataset in Amazon S3 in a single .zip archive that you can download or by following instructions in the prepare_data.ipynb notebook in your instance.

The archive contains 500 .jpg image files, and an output.manifest file, which we explain later in the post. We also have 10 test images in the 3_predict/test_images notebook folder that we use to visualize our model predictions.

Labeling images using SageMaker Ground Truth

To train an ML model, you need large, high-quality, labeled datasets. Labeling thousands of images can become tedious and time-consuming. Thankfully, Ground Truth makes it easy to crowdsource this task. Ground Truth offers easy access to public and private human labelers for annotating datasets. It provides built-in workflows and interfaces for common labeling tasks, including drawing bounding boxes for object detection.

You can now move on to creating labeling jobs in Ground Truth. In this post, we don’t cover each step in creating a labeling job. It’s already covered in detail in the post Amazon SageMaker Ground Truth – Build Highly Accurate Datasets and Reduce Labeling Costs by up to 70%.

For our dataset, we follow the recommended workflow from the post Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs to create our labeling instructions for the labeler.

The following screenshot shows an example of a labeling job configuration in Ground Truth.

At the end of a labeling job, Ground Truth saves an output manifest file in Amazon S3, where each line corresponds to a single image and its labeled bounding boxes, alongside some metadata. See the following code:

{"source-ref":"s3://sagemaker-remars/datasets/na-bees/500/10006450.jpg","bees-500":{"annotations":[{"class_id":0,"width":95.39999999999998,"top":256.2,"height":86.80000000000001,"left":177}],"image_size":[{"width":500,"depth":3,"height":500}]},"bees-500-metadata":{"job-name":"labeling-job/bees-500","class-map":{"0":"bee"},"human-annotated":"yes","objects":[{"confidence":0.75}],"creation-date":"2019-05-16T00:15:58.914553","type":"groundtruth/object-detection"}}
{"source-ref":"s3://sagemaker-remars/datasets/na-bees/500/10022723.jpg","bees-500":{"annotations":[{"class_id":0,"width":93.8,"top":228.8,"height":135,"left":126.8}],"image_size":[{"width":375,"depth":3,"height":500}]},"bees-500-metadata":{"job-name":"labeling-job/bees-500","class-map":{"0":"bee"},"human-annotated":"yes","objects":[{"confidence":0.82}],"creation-date":"2019-05-16T00:41:33.384412","type":"groundtruth/object-detection"}}
{"source-ref":"s3://sagemaker-remars/datasets/na-bees/500/10059108.jpg","bees-500":{"annotations":[{"class_id":0,"width":157.39999999999998,"top":188.60000000000002,"height":131.2,"left":110.8}],"image_size":[{"width":375,"depth":3,"height":500}]},"bees-500-metadata":{"job-name":"labeling-job/bees-500","class-map":{"0":"bee"},"human-annotated":"yes","objects":[{"confidence":0.8}],"creation-date":"2019-05-16T00:57:28.636681","type":"groundtruth/object-detection"}}
{"source-ref":"s3://sagemaker-remars/datasets/na-bees/500/10250726.jpg","bees-500":{"annotations":[{"class_id":0,"width":77.20000000000002,"top":204,"height":94.4,"left":79.6}],"image_size":[{"width":375,"depth":3,"height":500}]},"bees-500-metadata":{"job-name":"labeling-job/bees-500","class-map":{"0":"bee"},"human-annotated":"yes","objects":[{"confidence":0.81}],"creation-date":"2019-05-16T00:34:21.300882","type":"groundtruth/object-detection"}}

For your convenience, we previously completed a labeling job called bees-500 and included the augmented manifest file output.manifest in the dataset.zip archive. In the provided notebook, we upload this dataset to the default S3 bucket before data preparation.

Generating TFRecords and the dataset label map

To use our dataset in the TensorFlow Object Detection API, we must first combine its images and labels and convert them into the TFRecord file format. The TFRecord format is a simple format for storing a sequence of binary records, which helps in data reading and processing efficiency. We also need to generate a label map, which defines the mapping between a class ID and a class name.

In the provided preprocessing notebook, we build a custom SageMaker Processing job with our own processing container. We first build a Docker container with the necessary TensorFlow image, Python libraries, and code to run those steps and push it to an Amazon Elastic Container Registry (Amazon ECR) repository. We then launch a processing job, which runs the pushed container and prepares the data for training. See the following code:

data_processor = Processor(role=role, 
                           image_uri=container, 
                           instance_count=1, 
                           instance_type='ml.m5.xlarge',
                           volume_size_in_gb=30, 
                           max_runtime_in_seconds=1200,
                           base_job_name='tf2-object-detection')

input_folder = '/opt/ml/processing/input'
ground_truth_manifest = '/opt/ml/processing/input/output.manifest'
label_map = '{"0": "bee"}' # each class ID should map to the human readable equivalent
output_folder = '/opt/ml/processing/output'

data_processor.run(
    arguments= [
        f'--input={input_folder}',
        f'--ground_truth_manifest={ground_truth_manifest}',
        f'--label_map={label_map}',
        f'--output={output_folder}'
    ],
    inputs = [
        ProcessingInput(
            input_name='input',
            source=s3_input,
            destination=input_folder
        )
    ],
    outputs= [
        ProcessingOutput(
            output_name='tfrecords',
            source=output_folder,
            destination=f's3://{bucket}/data/bees/tfrecords'
        )
    ]
)

The job takes the .jpg images, the output.manifest, and the dictionary of classes as Amazon S3 inputs. It splits the dataset into a training and a validation datasets, generates the TFRecord and label_map.pbtxt files, and outputs them into the Amazon S3 destination of our choice.

Out of the total of 500 images, we use 450 for training and 50 for validation.

During the training the algorithm, we use the first set to train the model and the latter for evaluation.

You should end up with three files named label_map.pbtxt, train.records, and validation.records in the Amazon S3 destination you defined (s3://{bucket}/data/bees/tfrecords).

During the training the algorithm, we use the first set to train the model and the latter for evaluation.

We can now move to model training!

Fine-tuning an EfficientDet model with TensorFlow 2 on SageMaker

You can follow this section by running the cells in this notebook.

Building a TensorFlow 2 Object Detection API Docker container

In this step, we first build and push a Docker container based on the Tensorflow gpu image.

We install the TensorFlow Object Detection API and the sagemaker-training-toolkit library to make it easily compatible with SageMaker.

SageMaker offers several ways to run our custom container. For more information, see Amazon SageMaker Custom Training containers. For this post, we use script mode and instantiate our SageMaker estimator as a CustomFramework. This allows us to work dynamically with our training code stored in the source_dir folder and prevents us from pushing container images to Amazon ECR at every change.

The following screenshot shows the corresponding training folder structure.

Setting up TensorBoard real-time monitoring using SageMaker Debugger

To capture real-time model training and performance metrics, we use TensorBoard and SageMaker Debugger. First, we start by defining a TensorboardOutputConfig in which we specify the S3 path where we save the TensorFlow checkpoints. See the following code:

from sagemaker.debugger import TensorBoardOutputConfig

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

Each time the training script writes a date to the container_local_output_path, SageMaker uploads it to Amazon S3, allowing us to monitor in real time.

Training a TensorFlow 2 object detection model using SageMaker

We fine-tune a pre-trained EfficientDet model available in the TensorFlow 2 Object Detection Model Zoo, because it presents good performance on the COCO 2017 dataset and efficiency to run it.

We save the model checkpoint and its base pipeline.config in the source_dir folder, along with our training code.

We then adjust the pipeline.config so TensorFlow 2 can find the TFRecord and label_map.pbtxt files when they are loaded inside the container from Amazon S3.

Your source_dir folder should now look like the following screenshot.

Your source_dir folder should now look like the following screenshot.

We use run_training.sh as the run entry point. This is the main script that SageMaker runs during training time, and performs the following steps:

Launch the model training based on the specified hyperparameters.
Launch the model evaluation based on the last checkpoint saved during the training.
Prepare the trained model for inference using the exporter script.

You’re ready to launch the training job with the following commands:

hyperparameters = {
    "model_dir":"/opt/training",        
    "pipeline_config_path": "pipeline.config",
    "num_train_steps": 1000,    
    "sample_1_of_n_eval_examples": 1
}

estimator = CustomFramework(image_uri=container,
                            role=role,
                            entry_point='run_training.sh',
                            source_dir='source_dir/',
                            instance_count=1,
                            instance_type='ml.p3.8xlarge',
                            hyperparameters=hyperparameters,
                            tensorboard_output_config=tensorboard_output_config,
                            base_job_name='tf2-object-detection')

#We make sure to specify wait=False, so our notebook is not waiting for the training job to finish.

estimator.fit(inputs)

When the job is running, Debugger allows us to capture TensorBoard data into a chosen Amazon S3 location and monitor the progress in real time with TensorBoard. As we indicated in the log directory when configuring the TensorBoardOutputConfig object, we can use it to as the --logdir parameter.

Now, we can start up the TensorBoard server with the following command:

job_artifacts_path = estimator.latest_job_tensorboard_artifacts_path()
tensorboard_s3_output_path = f'{job_artifacts_path}/train'

!F_CPP_MIN_LOG_LEVEL=3 AWS_REGION=<ADD YOUR REGION HERE> tensorboard --logdir=$tensorboard_s3_output_path

TensorBoard runs on your notebook instance, and you can open it by visiting the URL https://your-notebook-instance-name.notebook.your-region.sagemaker.aws/proxy/6006/.

The following screenshot shows the TensorBoard dashboard after the training is over.

We can also look at the TensorBoard logs generated by the evaluation step. These are accessible under the following eval folder:

tensorboard_s3_output_path = f'{job_artifacts_path}/eval'
region_name = 'eu-west-1'

!F_CPP_MIN_LOG_LEVEL=3 AWS_REGION=$region_name tensorboard —logdir=$tensorboard_s3_output_path

This allows us to compare the ground truth data (right image in the following screenshot) and the predictions (left image).

Deploying your object detection model into a SageMaker endpoint

When the training is complete, the model is exported to a TensorFlow inference graph as a model.tar.gz.gz .pb file and saved in a model.tar.gz .zip file in Amazon S3 by SageMaker. model

SageMaker provides a managed TensorFlow Serving environment that makes it easy to deploy TensorFlow models.

To access the model_artefact path, you can open the training job on the SageMaker console, as in the following screenshot.

To access the model_artefact path, you can open the training job on the SageMaker console, as in the following screenshot.

When you have the S3 model artifact path, you can use the following code to create a SageMaker endpoint:

from sagemaker.tensorflow.serving import Model

model_artefact = '<your-model-s3-path>'

model = Model(model_data=model_artefact,
              name=name_from_base('tf2-object-detection'),
              role=role,
              framework_version='2.2')
              
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

When the endpoint is up and running, we can send prediction requests to it with test images and visualize the results using the Matplotlib library:

img = image_file_to_tensor('test_images/22673445.jpg')

input = {
  'instances': [img.tolist()]
}

detections = predictor.predict(input)['predictions'][0]

The following screenshot shows an example of our output.

Summary

In this post, we covered an end-to-end process of collecting and labeling data using Ground Truth, preparing and converting the data to TFRecord format, and training and deploying a custom object detection model using the TensorFlow Object Detection API.

Get started today! You can learn more about SageMaker and kick off your own machine learning experiments and solutions by visiting Amazon SageMaker console.

About the Authors

Sofian Hamiti is an AI/ML specialist Solutions Architect at AWS. He helps customers across industries accelerate their AI/ML journey by helping them build and operationalize end-to-end machine learning solutions.

Othmane Hamzaoui is a Data Scientist working in the AWS Professional Services team. He is passionate about solving customer challenges using Machine Learning, with a focus on bridging the gap between research and business to achieve impactful outcomes. In his spare time, he enjoys running and discovering new coffee shops in the beautiful city of Paris.

Artificial Intelligence