Label Training Data for Machine Learning

TUTORIAL

Overview

In this tutorial, learn how to set up a labeling job in Amazon SageMaker Ground Truth to annotate training data for your machine learning (ML) model. 

A labeled dataset is critical to supervised training of an ML model. Many organizations have huge datasets, but lack labels associated with the data. Using Amazon SageMaker Ground Truth, you can easily label data with the option to use human annotators through Amazon Mechanical Turk, third-party vendors, or your own private workforce. 

For this tutorial, you use SageMaker Ground Truth to label a set of images of vehicles, including airplanes, cars, ferries, helicopters, and motorbikes. Because this tutorial uses a non-sensitive dataset, you use the Amazon Mechanical Turk option.

What you will accomplish

In this guide, you will:

  • Create and configure a data labeling job
  • Review the results of the labeling job

Prerequisites

Before starting this guide, you will need:

 AWS experience

Beginner

 Time to complete

30 minutes

 Cost to complete

See SageMaker pricing to estimate cost for this tutorial.

 Requires

You must be logged into an AWS account.

 Services used

Amazon SageMaker Ground Truth

 Last updated

March 27, 2023

Implementation

Step 1: Set up your Amazon SageMaker Studio domain

With Amazon SageMaker, you can deploy a model visually using the console or programmatically using either SageMaker Studio or SageMaker notebooks. In this tutorial, you deploy the model programmatically using a SageMaker Studio notebook, which requires a SageMaker Studio domain.

If you already have a SageMaker Studio domain in the US East (N. Virginia) Region, follow the SageMaker Studio setup guide to attach the required AWS IAM policies to your SageMaker Studio account, then skip Step 1, and proceed directly to Step 2.

If you don't have an existing SageMaker Studio domain, continue with Step 1 to run an AWS CloudFormation template that creates a SageMaker Studio domain and adds the permissions required for the rest of this tutorial.

1.1  Choose the AWS CloudFormation stack link. This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-Catalog, and should not be changed. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then choose Create stack. This stack takes about 10 minutes to create all the resources.

This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC.

Step 2: Set up a SageMaker Studio notebook

In this step, you will launch a new SageMaker Studio notebook, install the necessary open source libraries, and set up the SageMaker variables required to interact with other services, including Amazon Simple Storage Service (Amazon S3).

2.1  Enter SageMaker Studio in the console search bar, and then choose SageMaker Studio.

2.2  Choose US East (N. Virginia) from the Region dropdown list in the upper right corner of the SageMaker console. To launch the app, select Studio from the left navigation pane, select studio-user as the user profile, and choose the Open Studio button.

2.3  The creating application screen within SageMaker Studio will be displayed. The application will take a few minutes to load.

2.4  Open the SageMaker Studio interface. From the top menu, choose File, New, Notebook.

2.5  In the Set up notebook environment dialog box, under Image, select Data Science. The Python 3 kernel is selected automatically. Under Instance type, choose ml.t3.medium. Choose Select.

2.6  The kernel on the top right corner of the notebook should now display Python 3 (Data Science).

Step 3: Create a labeling job

The sample images to be labeled in this tutorial are pulled from the publicly available Caltech 101 dataset (Li, F.-F., Andreeto, M., Ranzato, M. A., & Perona, P. (2022). Caltech 101 (Version 1.0) [Data set]. CaltechDATA), which contains pictures in 101 object categories. To minimize the cost of this tutorial, you use a sample set of 10 images, with two images from each of the following categories: airplanes, cars, ferries, helicopters, and motorbikes. But the steps to launch a labeling job for a larger dataset are the same as the ones in this tutorial. The sample set of 10 images is already available in the Amazon S3 bucket sagemaker-sample-files.

3.1  In this step, you use your SageMaker Studio notebook to write Python code that uploads the sample images from the sagemaker-sample-files S3 bucket to your default S3 bucket sagemaker-<your-Region>-<your-aws-account-id>. 

3.2  In the Jupyter notebook, in a new code cell, copy and paste the following code and run the cell. This will ensure you are on the current version of SageMaker.

NoteThe stack should create the necessary permissions and buckets.

Possible required action : https://repost.aws/knowledge-center/sagemaker-s3-accessdenied-training

3.3  In the Jupyter notebook, in a new code cell, copy and paste the following code and run the cell.

import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()

!aws s3 sync s3://sagemaker-sample-files/datasets/image/caltech-101/inference/ s3://{bucket}/ground-truth-demo/images/

print('Copy and paste the below link into a web browser to confirm the ten images were successfully uploaded to your bucket:')
print(f'https://s3.console.aws.amazon.com/s3/buckets/{bucket}/ground-truth-demo/images/')

print('\nWhen prompted by Sagemaker to enter the S3 location for input datasets, you can paste in the below S3 URL')

print(f's3://{bucket}/ground-truth-demo/images/')

print('\nWhen prompted by Sagemaker to Specify a new location, you can paste in the below S3 URL')

print(f's3://{bucket}/ground-truth-demo/labeled-data/')

3.4  After the code runs successfully, open the Amazon S3 console and navigate to sagemaker-<your-Region>-<your-aws-account-id>/ground-truth-demo/images location to confirm that the ten images have been uploaded.

3.5  Open the SageMaker console. In the left navigation pane, select Ground Truth, Labeling jobs. Then choose Create labeling job.

3.6  In the Specify job details page, under Job overview, enter vehicle-labeling-demo in the Job name box. Under Input data setup, select Automated data setup.

You can use the automated data setup to create manifest files for your labeling jobs in the SageMaker Ground Truth console using images, videos, video frames, text (.txt) files, and comma-separated value (.csv) files stored in Amazon S3. When you use automated data setup, you specify an Amazon S3 location where your input data is stored and specify the input data type, and SageMaker Ground Truth looks for the files that match that type in the location you specify.

3.7  In the Data setup section: For S3 location for input datasets, choose Browse S3, then select the S3 location s3://sagemaker-<your-Region>-<your-aws-account-id>/ground-truth-demo/images/. (This is the location where you uploaded the images in a previous step.) 

For S3 location for output datasets, select Specify a new location. Then, specify the path where the labeled images should be stored: s3://sagemaker-<your-Region>-<your-aws-account-id>/ground-truth-demo/labeled-data/. 

For Data type, select Image. For IAM Role, select Create a new role. Alternatively, you can use the corresponding values from the print statement in the Jupyter notebook we ran earlier.

3.8  In the Create an IAM role popup, select Any S3 bucket, then choose Create.

3.9  SageMaker Ground Truth creates the IAM role automatically and enters it into the IAM Role box. Choose Complete data setup. The confirmation message Input data connection successful appears.

3.10  In the Task type section, for Task category, select Image. For Task selection, select Image Classification (Single Label), and then choose Next.

3.11  On the Select workers and configure tool page, for Worker types, select Amazon Mechanical Turk.

Select The dataset does not contain adult content.

Select You understand and agree that the Amazon Mechanical Turk workforce consists of independent contractors located worldwide and that you should not share confidential information, personal information or protected health information with this workforce.

3.12  In the Image classification (Single Label) labeling tool section, enter the following information:

For brief description of task, enter Please select the label that best matches the image below. You can choose only 1 label per image.

For Select an option, enter the following labels in separate boxes: Airplane, Car, Ferry, Helicopter, Motorbike.

Expand Additional instructions, and append the following text to Step 3: If there are multiple vehicles in a single image, choose the most prominent vehicle in the image.

To see how the labeling tool appears to the labelers, choose Preview.

Choose Create.

3.13  The new vehicle-labeling-demo labeling job is listed under the Labeling jobs section in the SageMaker console with a Status of In progress and a Task type of Image Classification (Single Label). The labeling job could take several minutes to complete. After the data is labeled by the Amazon Mechanical Turk public workforce, the Status changes to Complete.

Step 4: Review the labeling job results

Reviewing the labeling job results is an important step because it helps you assess the labeling quality and identify if you need to improve the instructions and data.

4.1  In the left navigation pane of the SageMaker console, select Labeling jobs, and then choose vehicle-labeling-demo.

 

4.2  On the vehicle-labeling-demo details page, the Labeled dataset objects section shows the thumbnails of the images from your dataset with the corresponding labels as captions.

 

4.3  To access the full results of the labeling job, in the Labeling job summary section, choose the Output dataset location link.

4.4  Choose manifests, output, output.manifest.

Choose Open to download the labeling results in JSON Lines format. JSON Lines is a newline delimited format to store structured data where each line is a valid JSON value.

 

4.5  The output.manifest includes the following data: 

source-ref: Specifies the location of the image entry in the input manifest file. Because you selected Automated data setup in Step 2, Amazon SageMaker Ground Truth automatically created these entries and input manifest file.

vehicle-labeling-demo: Specifies the target label as a zero-indexed numeric value. For the five image classes in this example, the labels are 0, 1, 2, 3, and 4.

vehicle-labeling-demo-metadata: Specifies labeling metadata, such as the confidence score, job name, label string name (for example, airplane, car, ferry, helicopter, and motorbike), and human or machine annotated (active learning). 

You can parse the output.manifest file to create a labeled dataset for downstream applications such as image classification. For more information about how to use the output.manifest file with Amazon SageMaker to train models, read the blog post Easily train models using datasets labeled by SageMaker Ground Truth.

Step 5: Clean up your AWS resources

It is a best practice to delete resources that you no longer need so that you don't incur unintended charges.

5.1  To delete the S3 bucket, do the following: 

  • Open the Amazon S3 console. In the left navigation pane, select Buckets, sagemaker-<your-Region>-<your-account-id>, and then select the checkbox next to ground-truth-demo. Then, choose Delete
  • On the Delete objects dialog box, verify that you have selected the proper object to delete and enter permanently delete into the Permanently delete objects confirmation box.
  • Once this is complete and the bucket is empty, you can delete the sagemaker-<your-Region>-<your-account-id> bucket by following the same procedure again.
5.2  The Data Science kernel used for running the notebook image in this tutorial will accumulate charges until you either stop the kernel or perform the following steps to delete the apps. For more information, see Shut Down Resources in the Amazon SageMaker Developer Guide.

To delete the SageMaker Studio apps, do the following: In the SageMaker console, select Domains, and then choose StudioDomain. From the User profiles list, select studio-user, and then delete all the apps listed under Apps by choosing Delete app. To delete the JupyterServer, choose Action, then choose Delete. Wait until the Status changes to Deleted.

5.3  If you used an existing SageMaker Studio domain in Step 1, skip the rest of Step 5 and proceed directly to the conclusion section.

If you ran the CloudFormation template in Step 1 to create a new SageMaker Studio domain, continue with the following steps to delete the domain, user, and the resources created by the CloudFormation template.

5.4  To open the CloudFormation console, enter CloudFormation into the AWS console search bar, and select CloudFormation from the search results.

5.5  Open the CloudFromation console. In the left navigation pane, select Stacks. From the status dropdown list, select Active. Under Stack name, choose CFN-SM-IM-Lambda-catalog to open the stack details page.

5.6  On CFN-SM-IM-Lambda-catalog stack details page, choose Delete to delete the stack along with the resources it created in Step 1.

Conclusion

Congratulations! You have finished the Label Training Data for Machine Learning tutorial.
 
In this tutorial, you used Amazon SageMaker Ground Truth and Amazon Mechanical Turk to build a training dataset for machine learning.
 
You can continue your machine learning journey with Amazon SageMaker by following the next steps section below.


Was this page helpful?

Create an ML model automatically

Learn how to use AutoML to develop ML models without writing code.
Next »

Deploy a trained model

Learn how to deploy a trained ML model for inference.
Next »

Find more hands-on tutorials

Explore other machine learning tutorials to dive deeper.
Next »