Label Training Data for Machine Learning

TUTORIAL

Overview

In this tutorial, learn how to set up a labeling job in Amazon SageMaker Ground Truth to annotate training data for your machine learning (ML) model. 

A labeled dataset is critical to supervised training of an ML model. Many organizations have huge datasets, but lack labels associated with the data. Using Amazon SageMaker Ground Truth, you can easily label data with the option to use human annotators through Amazon Mechanical Turk, third-party vendors, or your own private workforce. 

For this tutorial, you use SageMaker Ground Truth to label a set of images of vehicles, including airplanes, cars, ferries, helicopters, and motorbikes. Because this tutorial uses a non-sensitive dataset, you use the Amazon Mechanical Turk option.

What you will accomplish

In this guide, you will:

  • Create and configure a data labeling job
  • Review the results of the labeling job

Prerequisites

Before starting this guide, you will need:

 AWS experience

Beginner

 Time to complete

30 minutes

 Cost to complete

See SageMaker pricing to estimate cost for this tutorial.

 Requires

You must be logged into an AWS account.

 Services used

Amazon SageMaker Ground Truth

 Last updated

July 6, 2022

Implementation

Step 1: Set up an Amazon SageMaker notebook instance

In the AWS console search bar, enter SageMaker, and then choose Amazon SageMaker to open the SageMaker console.

In the left navigation pane, click “Notebook”.  Choose Notebook instances, and then choose Create notebook instance.

On the Create notebook instance page, under Notebook instance settings, for Notebook instance name, enter SageMaker-Ground-Truth-Tutorial. For Notebook instance type, select ml.t2.medium

In the Permissions and encryption section, for IAM role, choose Create a new role. In the Create an IAM role dialog box, select Any S3 bucket and choose Create role. As a best practice, limit S3 bucket access to a specific IAM role with the minimum required permissions in production environments. Note this role name for clean up at the end.

SageMaker creates the AmazonSageMaker-ExecutionRole-<role-id> role. Keep the default settings for the remaining settings and choose Create notebook instance.

In the Notebook instances section, the newly created SageMaker-Ground-Truth-Tutorial notebook instance is displayed with a status of Pending. The notebook is ready when the status changes to InService.

Step 2: Create a labeling job

The sample images to be labeled in this tutorial are pulled from the publicly available Caltech 101 dataset (Li, F.-F., Andreeto, M., Ranzato, M. A., & Perona, P. (2022). Caltech 101 (Version 1.0) [Data set]. CaltechDATA), which contains pictures in 101 object categories. To minimize the cost of this tutorial, you use a sample set of 10 images, with two images from each of the following categories: airplanes, cars, ferries, helicopters, and motorbikes. But the steps to launch a labeling job for a larger dataset are the same as the ones in this tutorial. The sample set of 10 images is already available in the Amazon S3 bucket sagemaker-sample-files.

In this step, you use your SageMaker notebook instance to write Python code that uploads the sample images from the sagemaker-sample-files S3 bucket to your default S3 bucket sagemaker-<your-Region>-<your-aws-account-id>. After the SageMaker-Ground-Truth-Tutorial notebook instance changes status to InService, choose Open Jupyter.

In the Jupyter notebook, for New, select conda_python3.

Click on Untitled.ipynb to open the notebook. In a new code cell in the Jupyter notebook, copy and paste the following code and run the cell.

import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()

!aws s3 sync s3://sagemaker-sample-files/datasets/image/caltech-101/inference/ s3://{bucket}/ground-truth-demo/images/

print('Copy and paste the below link into a web browser to confirm the ten images were successfully uploaded to your bucket:')
print(f'https://s3.console.aws.amazon.com/s3/buckets/{bucket}/ground-truth-demo/images/')

print('\nWhen prompted by Sagemaker to enter the S3 location for input datasets, you can paste in the below S3 URL')

print(f's3://{bucket}/ground-truth-demo/images/')

print('\nWhen prompted by Sagemaker to Specify a new location, you can paste in the below S3 URL')

print(f's3://{bucket}/ground-truth-demo/labeled-data/')

After the code runs successfully, open the Amazon S3 console and navigate to sagemaker-<your-Region>-<your-aws-account-id>/ground-truth-demo/images location to confirm that the ten images have been uploaded.

Open the SageMaker console. On the left navigation panel, choose Ground Truth, Labeling jobs. Then choose Create labeling job.

In the Specify job details page, under Job overview, enter vehicle-labeling-demo in the Job name box. Under Input data setup, select Automated data setup

You can use the automated data setup to create manifest files for your labeling jobs in the SageMaker Ground Truth console using images, videos, video frames, text (.txt) files, and comma-separated value (.csv) files stored in Amazon S3. When you use automated data setup, you specify an Amazon S3 location where your input data is stored and specify the input data type, and SageMaker Ground Truth looks for the files that match that type in the location you specify.

In the Data setup section: For S3 location for input datasets, choose Browse S3, then select the S3 location s3://sagemaker-<your-Region>-<your-aws-account-id>/ground-truth-demo/images/. (This is the location where you uploaded the images in a previous step.) For S3 location for output datasets, select Specify a new location. Then, specify the path where the labeled images should be stored: s3://sagemaker-<your-Region>-<your-aws-account-id>/ground-truth-demo/labeled-data/. For Data type, select Image. For IAM Role, select Create a new role. Alternately, you can use the corresponding values from the print statement in the Jupyter notebook we ran earlier.

In the Create an IAM role popup, select Any S3 bucket, then choose Create

SageMaker Ground Truth creates the IAM role automatically and enters it into the IAM Role box. Choose Complete data setup. The confirmation message Input data connection successful appears.

In the Task type section, for Task category, select Image. For Task selection, select Image Classification (Single Label), and then choose Next.

On the Select workers and configure tool page, for Worker types, select Amazon Mechanical Turk.

Select The dataset does not contain adult content

Select You understand and agree that the Amazon Mechanical Turk workforce consists of independent contractors located worldwide and that you should not share confidential information, personal information or protected health information with this workforce.

 

In the Image classification (Single Label) labeling tool section, enter the following information:

For brief description of task, enter Please select the label that best matches the image below. You can choose only 1 label per image.

For Select an option, enter the following labels in separate boxes: Airplane, Car, Ferry, Helicopter, Motorbike.

Expand Additional instructions, and append the following text to Step 3: If there are multiple vehicles in a single image, choose the most prominent vehicle in the image.

To see how the labeling tool appears to the labelers, choose Preview

Choose Create.

The new vehicle-labeling-demo labeling job is listed under the Labeling jobs section in the SageMaker console with a Status of In progress and a Task type of Image Classification (Single Label). The labeling job could take several minutes to complete. After the data is labeled by the Amazon Mechanical Turk public workforce, the Status changes to Complete.

Step 3: Review the labeling job results

Reviewing the labeling job results is an important step because it helps you assess the labeling quality and identify if you need to improve the instructions and data.

In the left navigation pane of the SageMaker console, choose Labeling jobs, and then choose vehicle-labeling-demo.

 

On the vehicle-labeling-demo details page, the Labeled dataset objects section shows the thumbnails of the images from your dataset with the corresponding labels as captions.

 

To access the full results of the labeling job, in the Labeling job summary section, choose the Output dataset location link.

Choose manifests, output, output.manifest.

 

Choose Open to download the labeling results in JSON Lines format. JSON Lines is a newline delimited format to store structured data where each line is a valid JSON value.

 

The output.manifest includes the following data: 

source-ref: Specifies the location of the image entry in the input manifest file. Because you selected Automated data setup in Step 2, Amazon SageMaker Ground Truth automatically created these entries and input manifest file.

vehicle-labeling-demo: Specifies the target label as a zero-indexed numeric value. For the five image classes in this example, the labels are 0, 1, 2, 3, and 4.

vehicle-labeling-demo-metadata: Specifies labeling metadata, such as the confidence score, job name, label string name (for example, airplane, car, ferry, helicopter, and motorbike), and human or machine annotated (active learning). 

You can parse the output.manifest file to create a labeled dataset for downstream applications such as image classification. For more information about how to use the output.manifest file with Amazon SageMaker to train models, read the blog post Easily train models using datasets labeled by SageMaker Ground Truth.

Step 4: Clean up the resources

It is a best practice to delete resources that you no longer need so that you don't incur unintended charges.

To delete the S3 bucket, do the following: 

  • Open the Amazon S3 console. On the navigation bar, choose Buckets, sagemaker-<your-Region>-<your-account-id>, and then select the checkbox next to ground-truth-demo. Then, choose Delete
  • On the Delete objects dialog box, verify that you have selected the proper object to delete and enter permanently delete into the Permanently delete objects confirmation box. 

Open the AWS IAM console by typing IAM in the search bar of the AWS console and selecting IAM. On the IAM console left navigation panel, choose Roles. To search for the IAM role you used for this tutorial, enter Amazon in the search bar. Under Role name, select the role, and choose Delete. Note that this action requires admin privileges associated with your account.

To open the SageMaker console, enter SageMaker into the AWS console search bar, and choose Amazon SageMaker from the search results. In the SageMaker console left pane, choose Notebook instances, and then select SageMaker-Ground-Truth-Tutorial. For Actions, select Stop.

After the instance status changes to Stopped, choose Actions, then select Delete. Choose Delete in the confirmation popup.

Conclusion

Congratulations! You have finished the Label Training Data for Machine Learning tutorial. 

In this tutorial, you used Amazon SageMaker Ground Truth and Amazon Mechanical Turk to build a training dataset for machine learning. 

You can continue your machine learning journey with Amazon SageMaker by following the next steps section below.

Was this page helpful?

Create an ML model automatically

Learn how to use AutoML to develop ML models without writing code.
Next »

Deploy a trained model

Learn how to deploy a trained ML model for inference.
Next »

Find more hands-on tutorials

Explore other machine learning tutorials to dive deeper.
Next »