Labeling mixed-source, industrial datasets with Amazon SageMaker Ground Truth
Prior to using any kind of supervised machine learning (ML) algorithm, data has to be labeled. Amazon SageMaker Ground Truth simplifies and accelerates this task. Ground Truth uses pre-defined templates to assign labels that classify the content of images or videos or verify existing labels. Ground Truth allows you to define workflows for labeling various kinds of data, such as text, video, or images, without writing any code. Although these templates are applicable to a wide range of use cases in which the data to be labeled is in a single format or from a single source, industrial workloads often require labeling data from different sources and in different formats. This post explores the use case of industrial welding data consisting of sensor readings and images to show how to implement customized, complex, mixed-source labeling workflows using Ground Truth.
For this post, you deploy an AWS CloudFormation template in your AWS account to provision the foundational resources to get started with implementing of this labeling workflow. This provides you with hands-on experience for the following topics:
- Creating a private labeling workforce in Ground Truth
- Creating a custom labeling job using the Ground Truth framework with the following components:
- Designing a pre-labeling AWS Lambda function that pulls data from different sources and runs a format conversion where necessary
- Implementing a customized labeling user interface in Ground Truth using crowd templates that dynamically loads the data generated by the pre-labeling Lambda function
- Consolidating labels from multiple workers using a customized post-labeling Lambda function
- Configuring a custom labeling job using Ground Truth with a customized interface for displaying multiple pieces of data that have to be labeled as a single item
Prior to diving deep into the implementation, I provide an introduction into the use case and show how the Ground Truth custom labeling framework eases the implementation of highly complex labeling workflows. To make full use of this post, you need an AWS account on which you can deploy CloudFormation templates. The total cost incurred on your account for following this post is under $1.
Labeling complex datasets for industrial welding quality control
Although the mechanisms discussed in this post are generally applicable to any labeling workflow with different data formats, I use data from a welding quality control use case. In this use case, the manufacturing company running the welding process wants to predict whether the welding result will be OK or if a number of anomalies have occurred during the process. To implement this using a supervised ML model, you need to obtain labeled data with which to train the ML model, such as datasets representing welding processes that need to be labeled to indicate whether the process was normal or not. We implement this labeling process (not the ML or modeling process) using Ground Truth, which allows welding experts to make assessments about the result of a welding and assign this result to a dataset consisting of images and sensor data.
The CloudFormation template creates an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account that contains images (prefix
images) and CSV files (prefix
sensor_data). The images contain pictures taken during an industrial welding process similar to the following, where a welding beam is applied onto a metal surface (for image source, see TIG Stainless Steel 304):
The CSV files contain sensor data representing current, electrode position, and voltage measured by sensors on the welding machine. For the full dataset, see the GitHub repo. A raw sample of this CSV data is as follows:
The first column of the data is a timestamp in milliseconds normalized to the start of the welding process. Each row consists of various sensor values associated with the timestamp. The first row is the electrode position, the second row is the current, and the third row is the voltage (the other values are irrelevant here). For instance, the row with timestamp
1, 100 milliseconds after the start of the welding process, has an electrode position of
94.79, a current of
1464, and a voltage of
Because it’s difficult for humans to make assessments using the raw CSV data, I also show how to preprocess such data on the fly for labeling and turn it into more easily readable plots. This way, the welding experts can view the images and the plots to make their assessment about the welding process.
Deploying the CloudFormation template
To simplify the setup and configurations needed in the following, I created a CloudFormation template that deploys several foundations into your AWS account. To start this process, complete the following steps:
- Sign in to your AWS account.
- Choose one of the following links, depending on which AWS Region you’re using:
- Keep all the parameters as they are and select I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
- Choose Create stack to start the deployment.
The deployment takes about 3–5 minutes, during which time a bucket with data to label, some AWS Lambda functions, and an AWS Identity and Access Management (IAM) role are deployed. The process is complete when the status of the deployment switches to
The Outputs tab has additional information, such as the Amazon S3 path to the manifest file, which you use throughout this post. Therefore, it’s recommended to keep this browser tab open and follow the rest of the post in another tab.
Creating a Ground Truth labeling workforce
Ground Truth offers three options for defining workforces that complete the labeling: Amazon Mechanical Turk, vendor-specific workforces, and private workforces. In this section, we configure a private workforce because we want to complete the labeling ourselves. Create a private workforce with the following steps:
- On the Amazon SageMaker console, under Ground Truth, choose Labeling workforces.
- On the Private tab, choose Create private team.
- Enter a name for the labeling workforce. For our use case, I enter welding-experts.
- Select Invite new workers by email.
- Enter your e-mail address, an organization name, and a contact e-mail (which may be the same as the one you just entered).
- Choose Create private team.
The console confirms the creation of the labeling workforce at the top of the screen. When you refresh the page, the new workforce shows on the Private tab, under Private teams.
You also receive an e-mail with login instructions, including a temporary password and a link to open the login page.
- Choose the link and use your e-mail and temporary password to authenticate and change the password for the login.
It’s recommended to keep this browser tab open so you don’t have to log in again. This concludes all necessary steps to create your workforce.
Configuring a custom labeling job
In this section, we create a labeling job and use this job to explain the details and data flow of a custom labeling job.
- On the Amazon SageMaker console, under Ground Truth, choose Labeling jobs.
- Choose Create labeling job.
- Enter a name for your labeling job, such as WeldingLabelJob1.
- Choose Manual data setup.
- For Input dataset location, enter the ManifestS3Path value from the CloudFormation stack Outputs
- For Output dataset location, enter the
ProposedOutputPathvalue from the CloudFormation stack Outputs
- For IAM role, choose Enter a custom IAM role ARN.
- Enter the
SagemakerServiceRoleArnvalue from the CloudFormation stack Outputs
- For the task type, choose Custom.
- Choose Next.
The IAM role is a customized role created by the CloudFormation template that allows Ground Truth to invoke Lambda functions and access Amazon S3.
- Choose to use a private labeling workforce.
- From the drop-down menu, choose the workforce welding-experts.
- For task timeout and task expiration time, 1 hour is sufficient.
- The number of workers per dataset object is 1.
- In the Lambda functions section, for Pre-labeling task Lambda function, choose the function that starts with
- For Post-labeling task Lambda function, choose the function that starts with
- Enter the following code into the templates section. This HTML code specifies the interface that the workers in the private label workforce see when labeling items. For our use case, the template displays four images, and the categories to classify welding results is as follows:
The wizard for creating the labeling job has a preview function in the section Custom labeling task setup, which you can use to check if all configurations work properly.
- To preview the interface, choose Preview.
This opens a new browser tab and shows a test version of the labeling interface, similar to the following screenshot.
- To create the labeling job, choose Create.
Ground Truth sets up the labeling job as specified, and the dashboard shows its status.
To finalize the labeling job that you configured, you log in to the worker portal and assign labels to different data items consisting of images and data plots. The details on how the different components of the labeling job work together are explained in the next section.
- On the Amazon SageMaker console, under Ground Truth, choose Labeling workforces.
- On the Private tab, choose the link for Labeling portal sign-in URL.
When Ground Truth is finished preparing the labeling job, you can see it listed in the Jobs section. If it’s not showing up, wait a few minutes and refresh the tab.
- Choose Start working.
This launches the labeling UI, which allows you to assign labels to mixed datasets consisting of welding images and plots for current, electrode position, and voltage.
For this use case, you can assign seven different labels to a single dataset. These different classes and labels are defined in the HTML of the UI, but you can also insert them dynamically using the pre-labeling Lambda function (discussed in the next section). Because we don’t actually use the labeled data for ML purposes, you can assign the labels randomly to the five items that are displayed by Ground Truth for this labeling job.
After labeling all the items, the UI switches back to the list with available jobs. This concludes the section about configuring and launching the labeling job. In the next section, I explain the mechanics of a custom labeling job in detail and also dive deep into the different elements of the HTML interface.
Custom labeling deep dive
A custom labeling job combines the data to be labeled with three components to create a workflow that allows workers from the labeling workforce to assign labels to each item in the dataset:
- Pre-labeling Lambda function – Generates the content to be displayed on the labeling interface using the manifest file specified during the configuration of the labeling job. For this use case, the function also converts the CSV files into human readable plots and stores these plots as images in the S3 bucket under the prefix
- Labeling interface – Uses the output of the pre-labeling function to generate a user interface. For this use case, the interface displays four images (the picture taken during the welding process and the three graphs for current, electrode position, and voltage) and a form that allows workers to classify the welding process.
- Label consolidation Lambda function – Allows you to implement custom strategies to consolidate classifications of one or several workers into a single response. For our workforce, this is very simple because there is only a single worker whose labels are consolidated into a file, which is stored by Ground Truth into Amazon S3.
Before we analyze these three components, I provide insights into the structure of the manifest file, which describes the data sources for the labeling job.
Manifest and dataset files
The manifest file is a file conforming to the JSON lines format, in which each line represents one item to label. Ground Truth expects either a key
source or source-ref in each line of the file. For this use case, I use
source, and the mapped value must be a string representing an Amazon S3 path. For this post, we only label five items, and the JSON lines are similar to the following code:
For our use case with multiple input formats and files, each line in the manifest points to a dataset file that is also stored on Amazon S3. Our dataset is a JSON document, which contains references to the welding images and the CSV file with the sensor data:
Ground Truth takes each line of the manifest file and triggers the pre-labeling Lambda function, which we discuss next.
Pre-labeling Lambda function
A pre-labeling Lambda function creates a JSON object that is used to populate the item-specific portions of the labeling interface. For more information, see Processing with AWS Lambda.
Before Ground Truth displays an item for labeling to a worker, it runs the pre-labeling function and forwards the information in the manifest’s JSON line to the function. For our use case, the event passed to the function is as follows:
Although I omit the implementation details here (for those interested, the code is deployed with the CloudFormation template for review), the function for our labeling job uses this input to complete the following steps:
- Download the file referenced in the
sourcefield of the input (see the preceding code).
- Download the dataset file that is referenced in the source
- Download a CSV file containing the sensor data. The dataset file is expected to have a reference to this CSV file.
- Generate plots for current, electrode position, and voltage from the contents of the CSV file.
- Upload the plot files to Amazon S3.
- Generate a JSON object containing the references to the aforementioned plot files and the welding image referenced in the dataset file.
When these steps are complete, the function returns a JSON object with two parts :
- taskInput – Fully customizable JSON object that contains information to be displayed on the labeling UI.
- isHumanAnnotationRequired – A string representing a Boolean value (
True or False), which you can use to exclude objects from being labeled by humans. I don’t use this flag for this use case because we want to label all the provided data items.
For more information, see Processing with AWS Lambda.
Because I want to show the welding images and the three graphs for current, electrode position, and voltage, the result of the Lambda function is as follows for the first dataset:
In the preceding code, the
taskInput is fully customizable; the function returns the Amazon S3 paths to the images to display, and also a title, which has some non-functional text. Next, I show how to access these different parts of the
taskInput JSON object when building the customized labeling UI displayed to workers by Ground Truth.
Labeling UI: Accessing taskInput content
Ground Truth uses the output of the Lambda function to fill in content into the HTML skeleton that is provided at the creation of the labeling job. In general, the contents of the
taskInput output object is accessed using
task.input in the HTML code.
For instance, to retrieve the Amazon S3 path where the welding image is stored from the output, you need to access the path
taskInput/image/file. Because the
taskInput object from the function output is mapped to
task.input in the HTML, the corresponding reference to the welding image file is
task.input.image.file. This reference is directly integrated into the HTML code of the labeling UI to display the welding image:
grant_read_access filter is needed for files in S3 buckets that aren’t publicly accessible. This makes sure that the URL passed to the browser contains a short-lived access token for the image and thereby avoids having to make resources publicly accessible for labeling jobs. This is often mandatory because the data to be labeled, such as machine data, is confidential. Because the pre-labeling function has also converted the CSV files into plots and images, their integration into the UI is analogous.
Label consolidation Lambda function
The second Lambda function that was configured for the custom labeling job runs when all workers have labeled an item or the time limit of the labeling job is reached. The key task of this function is to derive a single label from the responses of the workers. Additionally, the function can be for any kind of further processing of the labeled data, such as storing them on Amazon S3 in a format ideally suited for the ML pipeline that you use.
Although there are different possible strategies to consolidate labels, I focus on the cornerstones of the implementation for such a function and show how they translate to our use case. The consolidation function is triggered by an event similar to the following JSON code:
The key item in this event is the
payload, which contains an
s3Uri pointing to a file stored on Amazon S3. This payload file contains the list of datasets that have been labeled and the labels assigned to them by workers. The following code is an example of such a list entry:
Along with an identifier that you could use to determine which worker labeled the item, each entry lists for each dataset which labels have been assigned. For example, in the case of multiple workers, there are multiple entries in
annotations. Because I created a single worker that labeled all the items for this post, there is only a single entry. The file
dataset-5.json has been labeled with
Not Sure for the classifier
The label consolidation function has to iterate over all list entries and determine for each dataset a label to use as the ground truth for supervised ML training. Ground Truth expects the function to return a list containing an entry for each dataset item with the following structure: