Build a highly accurate training dataset

with Amazon SageMaker Ground Truth

In this tutorial, you’ll learn how to use Amazon SageMaker Ground Truth to build a highly accurate training dataset for an image classification use case. Amazon SageMaker Ground Truth enables you to build highly accurate training datasets for labeling jobs that include a variety of use cases, such as image classification, object detection, semantic segmentation, and many more. With Amazon SageMaker Ground Truth, you can get easy access to human labelers, and use either built-in or custom workflows and interfaces for common labeling tasks.

Amazon SageMaker Ground Truth helps you to reduce the time and effort to create datasets by using machine learning to automatically apply labels to data. This is achieved by constantly learning from labels created by human labelers.

Amazon SageMaker Ground Truth gives you access to different workforce options:

  • Amazon Mechanical Turk – You get access to an on-demand, 24/7 workforce of over 500,000 independent contractors worldwide. This option is recommended for non-sensitive data.
  • Private – You can setup up access for a team of your own employees or contractors. This option is recommended for sensitive data or when domain expertise is required for the labeling job.
  • Vendor managed – You can get access to a list of third-party vendors approved by Amazon, who specialize in providing data labeling services, available through AWS Marketplace.

For this tutorial, you use Amazon Mechanical Turk to label a dataset with images of vehicles such as cars, trucks, limousines, vans, and motorcycles (bikes). In this tutorial, you complete the following steps:

  1. Create an Amazon SageMaker Notebook instance
  2. Download the public dataset and upload the dataset to Amazon S3
  3. Create the Amazon SageMaker Ground Truth labeling job
  4. Review the results of the labeling job
  5. Clean up your resources
About this Tutorial
Time 30 minutes    
Cost Approx. $120
Use Case Machine Learning
Products Amazon SageMaker Ground Truth
Audience Data analysts, Developers
Level Intermediate
Last Updated February 1, 2021

Before you begin

You must have an AWS account to complete this tutorial. If you do not already have an account, choose Sign up for AWS and create a new account.

Already have an account?
Log in to your account

Step 1. Create an Amazon SageMaker notebook instance for data preparation

In this step, you create the notebook instance that you use to download and process your data. As part of the creation process, you also create an Identity and Access Management (IAM) role that allows Amazon SageMaker to access data in Amazon S3.

a. Sign in to the Amazon SageMaker console, and in the top right corner, select your preferred AWS Region. This tutorial uses the US West (Oregon) Region.

b. In the left navigation pane, choose Notebook instances, then choose Create notebook instance.

c. On the Create notebook instance page, in the Notebook instance setting box, fill the following fields:

  • For Notebook instance name, type SageMaker-Ground-Truth-Tutorial.
  • For Notebook instance type, choose ml.t2.medium.
  • For Elastic inference, keep the default selection of none.

d. In the Permissions and encryption section, for IAM role, choose Create a new role, and in the Create an IAM role dialog box, select Any S3 bucket and choose Create role.

Note: If you already have a bucket that you’d like to use instead, choose Specific S3 buckets and specify the bucket name.

Amazon SageMaker creates the AmazonSageMaker-ExecutionRole-*** role. 

e. Keep the default settings for the remaining options and choose Create notebook instance.

In the Notebook instances section, the new SageMaker-Ground-Truth-Tutorial notebook instance is displayed with a Status of Pending. The notebook is ready when the Status changes to InService.

Step 2. Prepare and upload your dataset to Amazon S3

In this step, you use your Amazon SageMaker notebook instance to prepare your dataset for the Amazon SageMaker Ground Truth labeling job and upload it to Amazon S3.

The images that you upload to Amazon S3 for the labeling job are from the publicly available Google Open Image Dataset1 dataset, which has several categories of images. This tutorial downloads only images of trucks, limousines, vans, cars, and motorcycles (bikes). Because the images in the Google Open Images Dataset are already labeled, you can use this information to verify the quality of the labeling job after you get the results.  

a. After your SageMaker-Ground-Truth-Tutorial notebook instance status changes to InService, choose Open Jupyter.

b. In Jupyter, choose New and then choose conda_python3.

c. In a new code cell on your Jupyter notebook, copy and paste the following code and choose Run.

Note: Make sure to replace the BUCKET name sm-gt-dataset-*** with the name of your S3 bucket.

import itertools
import boto3

# Download and process the Open Images annotations
!wget -O openimgs-annotations.csv
with open('openimgs-annotations.csv', 'r') as f:
    all_labels = [line.strip().split(',') for line in f.readlines()]
 # Extract image ids in each of our desired classes
ims = {}
ims['Truck'] = [label[0] for label in all_labels if (label[2] == '/m/07r04' and label[3] == '1')][:300]
ims['Limousine'] = [label[0] for label in all_labels if (label[2] == '/m/01lcw4' and label[3] == '1')][:300]
ims['Van'] = [label[0] for label in all_labels if (label[2] == '/m/0h2r6' and label[3] == '1')][:300]
ims['Car'] = [label[0] for label in all_labels if (label[2] == '/m/0pg52' and label[3] == '1')][:300]
ims['Motorcycle'] = [label[0] for label in all_labels if (label[2] == '/m/04_sv' and label[3] == '1')][:300]
num_classes = len(ims)

# Prepare data and upload to your S3 bucket
BUCKET = 'sm-gt-dataset-***'
EXP_NAME = 'ground-truth-demo' # Any valid S3 prefix.
for key in ims.keys():
    ims[key] = set(ims[key])

# Create a new bucket for images to be labeled
s3 = boto3.client('s3')
sess = boto3.session.Session()
region = sess.region_name
if (region == 'us-east-1'):
                        CreateBucketConfiguration={'LocationConstraint': region})

# Copy the images to your local bucket
for img_id, img in enumerate(itertools.chain.from_iterable(ims.values())):
    copy_source = {
        'Bucket': 'open-images-dataset',
        'Key': 'test/{}.jpg'.format(img)
    s3.copy(copy_source, BUCKET, '{}/images/{}.jpg'.format(EXP_NAME, img))

d. After the code runs, open the Amazon S3 console and browse to [your-bucket-name] > /ground-truth-demo > /images.

The /images folder should contain 1,014 image files.

1 A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982, 2018. (

Disclaimer – The Open Images Dataset V5, which is created by Google Inc. and available at, has not been modified for this tutorial. The annotations are licensed by Google Inc. under the CC BY 4.0 license. The images are listed as having a CC BY 2.0 license. For detailed statistics about the data and evaluation of models trained on it, see the paper cited above.

Step 3. Create the Amazon SageMaker Ground Truth labeling job

In this step, you create a SageMaker Ground Truth labeling job for the dataset you prepared and uploaded to Amazon S3. The labeling job applies labels to these images to sort them into five categories – car, truck, limousine, van, and motorcycle (bike).  

a. In the left navigation pane of the Amazon SageMaker console, select Labeling Jobs. Then, choose Create labeling job.

b. In the Job overview section, in the Job name text box, enter vehicle-labeling-demo. For Input data setup, choose Automated data setup.

Note: With Automated data setup, Amazon SageMaker Amazon SageMaker Ground Truth can use active learning to automate the labeling of your input data. Active learning is a machine learning technique that identifies data that should be labeled by your workers. In Ground Truth, this functionality is called automated data labeling. Automated data labeling helps to reduce the cost and time that it takes to label your dataset compared to using only humans. For more information, see Automate Data Labeling.

c. In the Data setup section, fill these fields and make the following selections:

  • For S3 location for input datasets, specify the path to the images folder in your S3 bucket: s3://<your-bucket-name>/ground-truth-demo/images/
  • For S3 location for output datasets, choose Specify a new location. Then, specify the path where you want to store your labeled dataset: s3://<your-bucket-name>/ground-truth-demo/labeled_data/
  • For Data type, choose Image.
  • For IAM Role, choose the AmazonSageMaker-ExecutionRole-** you created in Step 1. 
Choose Complete data setup. A confirmation message appers noting that the input data connection was successful.

e. In the Task type section, for Task category, choose Image. For Task selection, choose Image Classification (Single Label).

f. For Enable enhanced image access, keep the default selection and choose Next.

g. On the Select workers and configure tool page, for Worker types, choose Amazon Mechanical Turk. Select the following check boxes:

  • The dataset does not contain adult content.
  • You understand and agree that the Amazon Mechanical Turk workforce consists of independent contractors located worldwide and that you should not share confidential information, personal information or protected health information with this workforce.

Keep the remaining default selections.

h. In the Image classification (Single Label) labeling tool section, enter the following information:

  • For brief description of task, copy and paste this description: Please select the label that best matches the image below. You can choose only 1 label per image.
  • For Select an option, add entries for each of these labels: Car, Bike, Van, Truck, Limousine.
  • Expand Additional instructions, and append the following text to Step 3: If there are multiple vehicles in a single image, choose the most prominent vehicle in the image.

Note: Due to the simplicity of this labeling job, you do not need to provide a good or bad example for this labeling job.

Optionally, choose Preview to view how the labeling tool appears to the labelers.

Then, choose Create.

The new vehicle-labeling-demo labeling job appears with a Status of In progress and a Task type of Image Classification (Single Label).

Note: This labeling job may take several hours to complete.

After the data is labeled by Amazon Mechanical Turk public workforce, the Status changes to Complete.

To see the job progress, select Amazon SageMaker > Labeling Jobs > vehicle-labeling-demo.

Step 4. Review labeling job results

In this step, you review the results of the completed labeling job.

a. In the left navigation pane of the Amazon SageMaker console, choose Labeling jobs, then choose vehicle-labeling-demo.

On the vehicle-labeling-demo details page, the Labeled dataset objects section shows the thumbnails of images from your dataset with the corresponding labels below them.

b. To see the full results of the labeling job, in the Labeling job summary section, click the Output dataset location link.

For example: s3://<your-bucket-name>/ground-truth-demo/labeled_data/vehicle-labeling-demo/  

c. In the S3 console, choose the manifests folder and navigate to /output/output.manifest.

d. Choose Object Actions, then Open.

The output.manifest is a JSON Lines format file that includes these fields:

  • source-ref: This field specifies the location of the image entry in the input manifest file. By selecting Automated data setup in Step 3, Amazon SageMaker Ground Truth automatically created these entries and input manifest file for you.
  • vehicle-labeling-demo – This field specifies the label as a numeric value that corresponds to the label (0–4)
  • vehicle-labeling-demo-metadata – This field specifies labeling metadata, such as confidence score, the name of the job, the string name for the label (such as car, truck, or van), human or machine annotated (active learning), and other information.

You can use the label information in the output.manifest file to train an image classifier that can classify images of vehicles into car, truck, limousine, van, or motorcycle (bike).

For more information about how to use the output.manifest file with Amazon SageMaker to train models, see the Easily train models using datasets labeled by Amazon SageMaker Ground Truth blog post.

Step 5. Clean up

In the following steps, you clean up the resources you created in this tutorial.

It is a best practice to delete instances and resources that you are no longer using so that you are not continually charged for them.

Delete Amazon SageMaker Notebook instance

5.1 — Navigate to the Amazon SageMaker console, and in the left pane, choose Notebook instances.

5.2 — Choose the tutorial SageMaker-Ground-Truth-Tutorial instance.

5.3 — Choose Actions, then choose Stop.

Note: After your instance is stopped, it does not incur charges.

5.4 — Choose Actions, then choose Delete. Choose Delete.

Delete other resources

Optionally, stop the SageMaker Ground Truth labeling job, delete the Amazon S3 bucket, and delete the Amazon SageMaker IAM role created for this tutorial.


You used Amazon SageMaker Ground Truth to build a training dataset with Amazon Mechanical Turk. You can now use these labeled images to train image classification models that can classify images of vehicles into the five labeled categories – cars, trucks, vans, limousines, and motorcycles.

Was this tutorial helpful?

Thank you
Please let us know what you liked.
Sorry to disappoint you
Is something out-of-date, confusing or inaccurate? Please help us improve this tutorial by providing feedback.

Learn more about Amazon SageMaker Ground Truth features

Find out more about the features of Amazon SageMaker Ground Truth.

Read about more Amazon Fraud Detector applications

Find more Amazon SageMaker Ground Truth resources

See Amazon SageMaker Ground Truth developer resources for videos, labs, documentation, and more.