Build Highly Accurate Training Datasets

with Amazon SageMaker Ground Truth

In this tutorial, you’ll learn how to use Amazon SageMaker Ground Truth to build a highly accurate training dataset for an image classification use case. Amazon SageMaker Ground Truth enables you to build highly accurate training datasets for labeling jobs that include a variety of use cases, such as image classification, object detection, semantic segmentation, and many more. With Amazon SageMaker Ground Truth, you can get easy access to human labelers, and use either built-in or custom workflows and interfaces for common labeling tasks.

Amazon SageMaker Ground Truth helps you to reduce the time and effort to create datasets by using machine learning to automatically apply labels to data. This is achieved by constantly learning from labels created by human labelers.

During this tutorial, you’ll label a dataset with images of vehicles such as cars, trucks, limousines, vans, and motorcycles (bikes).

Amazon SageMaker Ground Truth gives you access to different workforce options:

  • Amazon Mechanical Turk – You get access to an on-demand, 24/7 workforce of over 500,000 independent contractors worldwide. This option is recommended for non-sensitive data.
  • Private workforce – You can setup up access for a team of your own employees or contractors. This option is recommended for sensitive data or when domain expertise is required for the labeling job.
  • Vendor workforce – You can get access to a list of third-party vendors approved by Amazon, who specialize in providing data labeling services, available through AWS Marketplace.

For this tutorial, we use Amazon Mechanical Turk.

To create a labeling job with Amazon SageMaker Ground Truth, follow these steps:

  1. Prepare the images that need to be labeled and upload them to Amazon Simple Storage Service (Amazon S3).
  2. Specify the possible categories for the labels.
  3. Create instructions for human labelers.
  4. Write a labeling job specification.
  5. Submit a job to Amazon Mechanical Turk.
About this Tutorial
Time 10+ minutes
Use Case Machine Learning
Products Amazon SageMaker Ground Truth, Amazon Mechanical Turk, Amazon Simple Storage Service
Last Updated August 23, 2019

Before you begin

Before you begin this tutorial, you must have an AWS account. If you do not already have an account, click Sign up for AWS and create a new account.

Already have an account?
Log in to your account

Step 1 – Log in to the Amazon SageMaker console


Open the AWS Management Console in a new window, so you can keep this tutorial open.


In the AWS Console search bar, type SageMaker and select Amazon SageMaker to open the service console.


Step 2 – Create an Amazon SageMaker notebook instance for data preparation

In this step, you download a sample dataset that needs to be labeled and upload it to an Amazon Simple Storage Service (Amazon S3) bucket that you create. Because the service expects to get the dataset from Amazon S3, you must complete these steps before you start a labeling job with Amazon SageMaker Ground Truth.

To download the sample dataset and upload it to Amazon S3, you use an Amazon SageMaker notebook instance. To upload the dataset, your Amazon SageMaker notebook instance needs secure access to Amazon S3. To provide these permissions, Amazon SageMaker can create a new AWS Identity and Access Management (IAM) role with the required permissions and assign it to your instance for you.

Note – You can use any client with permissions to access your Amazon S3 buckets to perform these steps. In this tutorial, we use an Amazon SageMaker notebook instance for simplicity and convenience. If you want to use a local client with AWS Command Line Interface (AWS CLI), python, and boto3 installed, you can proceed to Step 3 – 3.


From the Amazon SageMaker > Notebook instances page, select Create notebook instance.


In the Create notebook instance section, in the Notebook instance name text box, enter a name for the notebook instance.

For example, in this tutorial we specified GroundTruthDatasetInstance as the instance name.


To create an IAM role, from the IAM role drop-down list, select Create a new role.


In the Create an IAM role dialog box, select Any S3 bucket.

This allows your Amazon SageMaker instance to access all the Amazon S3 buckets in your account.

If you already have a bucket that you’d like to use instead, select Specific S3 buckets and specify the bucket name.


Select Create role.
Amazon SageMaker creates the AmazonSageMaker-ExecutionRole-*** role. 


Keep the default settings for the other options and click Create notebook instance.

In the Notebook instances section, the new GroundTruthDatasetInstance notebook instance entry appears, with a Status of Pending.

Step 3 – Prepare and upload your dataset to Amazon S3

In this step you use the Amazon SageMaker notebook instance that you created in Step 2 to prepare your dataset for the Amazon SageMaker Ground Truth labeling job and upload it to Amazon S3. The images that you upload to Amazon S3 for the labeling job are from the publicly available Google Open Image Dataset1 dataset, which has several categories of images. For this tutorial, download only images of trucks, limousines, vans, cars, and motorcycles (bikes). Because the images in the Google Open Images Dataset are already labeled, you can use this information to verify the quality of the labeling job after you get the results.  


In the Notebook instances section, after the GroundTruthDatasetInstance Status changes from Pending to InService, from the Actions column, select Open Jupyter.


After GroundTruthDatasetInstance appears in the Jupyter Files tab, from the New drop-down list, select conda_python3.


To upload your images to Amazon S3, copy the following code into the code cells in your instance.

The default bucket name for this tutorial is sm-gt-dataset.

To specify a different bucket name, change the BUCKET variable in the following python script and replace sm-gt-dataset in the remainder of the tutorial with your bucket name.

Note – To shorten the code blocks, you can split the code into multiple cells.
To create a new cell, select File > +.
Or, select Insert > Insert Cell Below.
import itertools
import boto3

# Download and process the Open Images annotations
!wget -O openimgs-annotations.csv
with open('openimgs-annotations.csv', 'r') as f:
    all_labels = [line.strip().split(',') for line in f.readlines()]
 # Extract image ids in each of our desired classes
ims = {}
ims['Truck'] = [label[0] for label in all_labels if (label[2] == '/m/07r04' and label[3] == '1')][:300]
ims['Limousine'] = [label[0] for label in all_labels if (label[2] == '/m/01lcw4' and label[3] == '1')][:300]
ims['Van'] = [label[0] for label in all_labels if (label[2] == '/m/0h2r6' and label[3] == '1')][:300]
ims['Car'] = [label[0] for label in all_labels if (label[2] == '/m/0pg52' and label[3] == '1')][:300]
ims['Motorcycle'] = [label[0] for label in all_labels if (label[2] == '/m/04_sv' and label[3] == '1')][:300]
num_classes = len(ims)

# Prepare data and upload to your S3 bucket
BUCKET = 'sm-gt-dataset'
EXP_NAME = 'ground-truth-demo' # Any valid S3 prefix.
for key in ims.keys():
    ims[key] = set(ims[key])

# Create a new bucket for images to be labeled
s3 = boto3.client('s3')
sess = boto3.session.Session()
region = sess.region_name
                        CreateBucketConfiguration={'LocationConstraint': region})

# Copy the images to your local bucket
for img_id, img in enumerate(itertools.chain.from_iterable(ims.values())):
    copy_source = {
        'Bucket': 'open-images-dataset',
        'Key': 'test/{}.jpg'.format(img)
    s3.copy(copy_source, BUCKET, '{}/images/{}.jpg'.format(EXP_NAME, img))


To run the code, select Cell > Run All.

After the code executes, the dataset is available in the following default location:
Amazon S3 > sm-gt-dataset > ground-truth-demo > images.

This folder should now contain 1014 images of vehicles including cars, trucks, limousines, vans, and motorcycles (bikes).

Note – If you specified a different name for your bucket, the dataset is available in the folder for the bucket you created:
Amazon S3 > your_bucket_name > ground-truth-demo > images.


1 A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982, 2018. (

Disclaimer – The Open Images Dataset V5, which is created by Google Inc. and available at, has not been modified for this tutorial. The annotations are licensed by Google Inc. under the CC BY 4.0 license. The images are listed as having a CC BY 2.0 license. For detailed statistics about the data and evaluation of models trained on it, see the paper cited above.

Step 4 – Create an Amazon SageMaker Ground Truth labeling job

In this step, you create a SageMaker Ground Truth labeling job for the dataset you prepared in Step 3. The goal is to apply labels to these images to sort them into five categories – car, truck, limousine, van, and motorcycle (bike).  


From the Amazon SageMaker dashboard, select Labeling Jobs > Create labeling job.


In the Job overview section, in the Job name text box, enter vehicle-labeling-demo.


Specify the location of your input dataset. Amazon SageMaker Ground Truth expects an input manifest file with a reference to an image in each line.

For example, each image must have an entry in the manifest file in the following format:

{"source-ref": "s3://sm-gt-dataset/ground-truth-demo/images/2563c7e7e3432a6e.jpg"}


To have Amazon SageMaker Ground Truth automatically create this manifest, in the Input dataset location section, click Create manifest file.


In the Input dataset location text box, enter the location of the images and click Create.

For example: s3://sm-gt-dataset/ground-truth-demo/images/

Make sure you enter the correct bucket and folder names that you specified in Step 3 – 3.


After the manifest has been created, click Use this manifest.


Specify the path to the Amazon S3 bucket where you want to store your labeled dataset.

For example: s3://sm-gt-dataset/ground-truth-demo/labeled_data/


Select the IAM role created in Step 3.
Or, follow the instructions in the drop-down list to create a new IAM role.


In the Additional configuration section, you can select options to label subsets of your dataset and specify encryption settings.

For this example, do not select any options.


In the Task type window, from the Task category drop-down list, select Image


For the Task selection option, select Image classification. Click Next.


In the Select workers and configure tool section, for the Worker types, select Public


From the Price per task drop-down list, keep the default option.

This is a low-complexity task that takes less than five seconds for labelers, so the default option is correct.

If your task is more complex, such as object detection or semantic segmentation, you should choose a higher price per task.


Select these check boxes:

  • The dataset does not contain adult content.
  • I understand that my dataset will be viewed by the Amazon mechanical Turk public workforce and I acknowledge that my dataset does not contain personally identifiable information (PII).


Make sure the Automated data labeling checkbox is cleared.

You should only select this option if you have more than 1,250 images, which is the required threshold to enable automated labeling with active learning. If you have larger datasets, automated data labeling can reduce the total cost of labeling your large datasets.

For more information about when to select automated data labeling, see Using Automated Data Labeling in the Amazon SageMaker Developer Guide.


In the Additional configuration section, in the Number of workers per dataset object text box, keep the default setting of 3 workers.


In the Image classification labeling tool section, in both template text boxes, add instructions for the labelers. Click Submit.

These images show an example of the instructions in the template and a preview of how the labeling tool appears to the labelers.


To verify your new labeling job was added, in the console, select Amazon SageMaker > Labeling Jobs.

The new vehicle-labeling-demo labeling job should appear with a Status of In progress and a Task type of Image classification.

Note – This job can take several hours to complete.

After the data is labeled by Amazon Mechanical Turk public workforce, the Status changes to Complete.


To see the job progress, select Amazon SageMaker > Labeling Jobs > vehicle-labeling-demo.

Step 5 – Review labeling job results

In this step, you review the results of the labeling job after it is complete.


In the console, select Amazon SageMaker > Labeling jobs > vehicle-labeling-demo.

In the Labeled dataset objects section, the thumbnails of images from your dataset appear with the corresponding labels below them.


To see the full results of the labeling job, in the Labeling job summary section, click the Output dataset location link.

For example: s3://sm-gt-dataset/ground-truth-demo/labeled_data/vehicle-labeling-demo/  


Navigate to the manifests directory and open the output.manifest file.

For example: s3://sm-gt-dataset/ground-truth-demo/labeled_data/vehicle-labeling-demo/manifests/output/output.manifest

The output.manifest is a JSON Lines format file, which is similar to the autogenerated input.manifest file from Step 4, but includes these additional fields:

  • vehicle-labeling-demo – This field specifies the label as a numeric value that corresponds to the label (0–4)
  • vehicle-labeling-demo-metadata – This field specifies labeling metadata, such as confidence score, the name of the job, the string name for the label (such as car, truck, or van), human or machine annotated (active learning), and other information.

You can use the label information in the output.manifest file to train an image classifier that can classify images of vehicles into car, truck, limousine, van, or motorcycle (bike).

For more information about how to use the output.manifest file with Amazon SageMaker to train models, see the Easily train models using datasets labeled by Amazon SageMaker Ground Truth blog post.

Step 6 – Terminate your resources

The final step of this tutorial is to terminate your Amazon SageMaker related resources. Terminating resources that you are not actively using reduces your costs and is a best practice. Any resources that you do not terminate will result in charges to your account. 


In the AWS Management Console, select Amazon SageMaker > Notebook instances > GroundTruthDatasetInstance.


From the Actions drop-down list, select Stop.

Note – After your instance is stopped, it does not incur charges.

To remove the instance after it's stopped, from the Actions drop-down list, select Delete.



You have learned how to use Amazon SageMaker Ground Truth to build a training dataset with Amazon Mechanical Turk. You can now use these labeled images to train images classification models that can classify images of vehicles into the five labeled categories – cars, trucks, vans, limousines, and motorcycles.

To learn more about how to use labeled images from Ground Truth jobs to easily train models, see the Easily train models using datasets labeled by Amazon SageMaker Ground Truth blog post.

Was this tutorial helpful?

Thank you
Please let us know what you liked.
Sorry to disappoint you
Is something out-of-date, confusing or inaccurate? Please help us improve this tutorial by providing feedback.