Label Training Data for Machine Learning
In this tutorial, learn how to set up a labeling job in Amazon SageMaker Ground Truth to annotate training data for your machine learning (ML) model.
A labeled dataset is critical to supervised training of an ML model. Many organizations have huge datasets, but lack labels associated with the data. Using Amazon SageMaker Ground Truth, you can easily label data with the option to use human annotators through Amazon Mechanical Turk, third-party vendors, or your own private workforce.
For this tutorial, you use SageMaker Ground Truth to label a set of images of vehicles, including airplanes, cars, ferries, helicopters, and motorbikes. Because this tutorial uses a non-sensitive dataset, you use the Amazon Mechanical Turk option.
What you will accomplish
In this guide, you will:
- Create and configure a data labeling job
- Review the results of the labeling job
Before starting this guide, you will need:
- An AWS account: If you don't already have an account, follow the Setting Up Your AWS Environment getting started guide for a quick overview.
1.1 Choose the AWS CloudFormation stack link. This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-Catalog, and should not be changed. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then choose Create stack. This stack takes about 10 minutes to create all the resources.
This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC.
2.1 Enter SageMaker Studio in the console search bar, and then choose SageMaker Studio.
2.2 Choose US East (N. Virginia) from the Region dropdown list in the upper right corner of the SageMaker console. To launch the app, select Studio from the left navigation pane, select studio-user as the user profile, and choose the Open Studio button.
2.3 The creating application screen within SageMaker Studio will be displayed. The application will take a few minutes to load.
2.4 Open the SageMaker Studio interface. From the top menu, choose File, New, Notebook.
2.5 In the Set up notebook environment dialog box, under Image, select Data Science. The Python 3 kernel is selected automatically. Under Instance type, choose ml.t3.medium. Choose Select.
2.6 The kernel on the top right corner of the notebook should now display Python 3 (Data Science).
4.1 In the left navigation pane of the SageMaker console, select Labeling jobs, and then choose vehicle-labeling-demo.
4.2 On the vehicle-labeling-demo details page, the Labeled dataset objects section shows the thumbnails of the images from your dataset with the corresponding labels as captions.
4.3 To access the full results of the labeling job, in the Labeling job summary section, choose the Output dataset location link.
4.4 Choose manifests, output, output.manifest.
Choose Open to download the labeling results in JSON Lines format. JSON Lines is a newline delimited format to store structured data where each line is a valid JSON value.
4.5 The output.manifest includes the following data:
source-ref: Specifies the location of the image entry in the input manifest file. Because you selected Automated data setup in Step 2, Amazon SageMaker Ground Truth automatically created these entries and input manifest file.
vehicle-labeling-demo: Specifies the target label as a zero-indexed numeric value. For the five image classes in this example, the labels are 0, 1, 2, 3, and 4.
vehicle-labeling-demo-metadata: Specifies labeling metadata, such as the confidence score, job name, label string name (for example, airplane, car, ferry, helicopter, and motorbike), and human or machine annotated (active learning).
You can parse the output.manifest file to create a labeled dataset for downstream applications such as image classification. For more information about how to use the output.manifest file with Amazon SageMaker to train models, read the blog post Easily train models using datasets labeled by SageMaker Ground Truth.