Identifying worker labeling efficiency using Amazon SageMaker Ground Truth

A critical success factor in machine learning (ML) is the cleanliness and accuracy of training datesets. Training with mislabeled or inaccurate data can lead to a poorly performing model. But how can you easily determine if the labeling team is accurately labeling data?

One way is to manually sift through the results one worker at a time, and tally up accuracy data. The better way is automation for quicker, more accurate results.

This post focuses on how you can analyze the accuracy of human labors using Amazon SageMaker Ground Truth. The post walks you through setting up and submitting a labeling job in Ground Truth, and from there, analyzing the accuracy of the workers who performed the annotations.

This post is targeted at anyone interested in learning how to perform a Text Classification labeling job with Amazon SageMaker using the Mechanical Turk (MTurk) worker type, and subsequently verifying the quality of the labels. The post walks you a tutorial in which you perform each step of the process, so no prior knowledge in ML/AI or Amazon SageMaker is required.

Use case

This post aims to create a model that can predict whether a sentence is referencing a person, animal, or plant. You can train your model on some examples of each, but before you do that, you need to have an MTurk workforce annotate the training data correctly.

Prerequisites

To complete the tutorial, you need to set up the prerequisites for the following:

An AWS account
IAM permissions
An Amazon S3 bucket
An Amazon SageMaker notebook instance
A Jupyter notebook

Setting up an AWS account

If you haven’t already, create an AWS account. This tutorial incurs AWS usage charges, so be sure to shut down and delete resources when you are finished.

Setting up IAM permissions

If you’ve created labeling jobs in the past with Ground Truth, you may already have the permissions needed to complete this tutorial. Those permissions include:

SageMakerFullAccess policy for the end-user – To give you access to the Amazon SageMaker GUI to perform the steps outlined in this tutorial, you need the SageMakerFullAccess policy applied to the user, group, or role assumed to perform this tutorial. If you do not already have this policy, ask your administrator to grant you this policy.
SageMakerFullAccess policy for the Jupyter Notebook and Ground Truth execution roles – The Jupyter Notebook and Ground Truth labeling jobs you create in this tutorial need to run with a role that also has the SageMakerFullAccess policy. You can create the role one time and use it for both Jupyter Notebook and Ground Truth.

If you can create these roles yourself, the Amazon SageMaker GUI walks you through a wizard to set them up. If you do not have access to create these roles, ask your administrator to create them for you by completing the following steps:

From the IAM console, choose Roles.
Choose Create Role.
For Service, choose SageMaker.
Choose Next: Permissions. AmazonSageMakerFullAccess should be displayed.
Choose Next: Tags.
Choose Next: Review.
For Role name, enter sagemaker-execution-role.
Choose Create.

The Administrator can provide the ARN of this newly created role to the end-user.

Setting up an S3 bucket

By default, the SageMakerFullAccess policy only grants access to S3 buckets containing sagemaker in their name. Be sure to name your buckets accordingly, or modify the policy accordingly to provide the appropriate access.

Part of the labeling (and subsequent labeling verification steps) requires you to read and write files to S3. To make that possible, you need to create an S3 bucket. For more information, see Step 1: Create an Amazon S3 Bucket.

There is no need for public access to this bucket, so do not grant it.

Setting up an Amazon SageMaker notebook instance

An Amazon SageMaker notebook instance is a fully managed Amazon EC2 compute instance that runs the Jupyter Notebook app. For more information, see Step 2: Create an Amazon SageMaker Notebook Instance. Be sure to set up your notebook instance in the same Region as your S3 bucket.

Per the previous instructions, you must specify the IAM role under which the notebook instance runs. The specifics of this step depend on the permissions you possess.

If you have access to create the execution role, when creating the IAM, for the least permissive access, restrict access to the specific bucket you created for this tutorial in the Specific S3 buckets field. See the following screenshot.

If you do not have access to create the execution role and your administrator created the execution role for you, instead of choosing to create a new role, choose Enter a custom IAM role ARN and enter the administrator-provided ARN for the execution role.

Setting up a Jupyter notebook

To test everything is running correctly, create a test Jupyter notebook in the notebook instance you created in the previous step.

Running an ml.t2.medium for an hour against the materials in this tutorial is estimated to incur the following costs:

$0.016 per GB data in
$0.016 per GB data out
$0.01 for data storage
$0.05 for compute time

The total is $0.11.

For more information about Notebook pricing, see Amazon SageMaker Pricing.

Creating and uploading the manifest files

Now that you have met the prerequisites, you can create the input and golden standard manifest files.

Creating the dataset files

Create a file locally named dataset.manifest that contains the following code:

{"source":"His nose could detect over 1 trillion odors!"}
{"source":"Why do fish live in salt water? Because pepper makes them sneeze!"}
{"source":"What did the buffalo say to his son when he went away on a trip? Bison!"}
{"source":"Why do plants go to therapy? To get to the roots of their problems!"}
{"source":"What do you call a nervous tree? A sweaty palm!"}
{"source":"Some kids in my family really like birthday cakes and stars!"}
{"source":"A small portion of the human population carries a fabella bone."}

Create a file locally named golden.manifest that contains the following code:

{"0":"person"}
{"1":"animal"}
{"2":"animal"}
{"3":"plant"}
{"6":"person"}

dataset.manifest

The dataset.manifest file is used as the input dataset for the labeling job. Each object in dataset.manifest contains a line of text describing either a person, animal, or plant. One or more of these lines of text is presented as tasks to your workers; they are responsible for correctly identifying which of the three classifications the line of text best fits.

For this post, dataset.manifest only has seven lines (workers can label up to seven objects), but this input dataset file could have up to 100,000 entries.

golden.manifest

Use the golden.manifest as an answer sheet, or golden standard, to check your workers’ labeling accuracy against tasks in dataset.manifest. Each entry in the golden.manifest file correlates to the correct (golden) label for a specific object in the dataset.manifest file. You can achieve this correlation by mapping a key in the golden.manifest file with the line number (where the index starts at zero) of the dataset.manifest file.

The following image shows how each golden standard answer in golden.manifest maps to its associated object or question in the dataset.manifest.

The image on the left represents the entries in golden.manifest, and the image on the right represents entries in dataset.manifest. Each key contained in golden.manifest represents exactly one golden answer associated with an object in dataset.manifest.

Each golden answer in the golden.manifest maps to a valid index line number in the dataset.manifest file. However, this post didn’t define answers for all the object indexes (specifically four and five) in the input dataset.manifest.

You can effectively spot-check worker accuracy and make generalizations on that worker’s accuracy across the entire dataset by creating a diverse subset of golden answers dispersed evenly throughout the dataset.

You may want to provide a golden answer for every object in the input dataset.manifest. This may be useful if you want to perform a more exhaustive (potentially prerequisite or qualifying) measurement of a worker’s accuracy.

Uploading the dataset files to S3

You can now upload both files to the root directory of your S3 bucket. When uploading these files to S3, you can accept all defaults. For more information, see How Do I Upload Files and Folders to an S3 Bucket?

Creating a labeling job

After you upload the files to S3, you can create your labeling job. Complete the following steps.

From the Amazon SageMaker console, choose Labeling jobs.
Choose Create labeling job.
Under Job overview, provide the following information:
1. For Job name, enter person-animal-plant.
2. Do not select I want to specify a label attribute name different from the labeling job name.
3. For Input dataset location, enter the S3 location of the manifest file that you created. This post uses s3://gec-sagemaker-blog/dataset.manifest.
4. For Output dataset location, enter the location where your output data is written. This post uses s3://gec-sagemaker-blog/output.
5. For IAM role, enter the IAM role under which the labeling job runs.

What you enter here depends on the permissions you possess.

If you have access to create the execution role, you can re-use the role you created earlier when you set up the notebook instance. Choose Use existing role, and select the existing role name (it begins with AmazonSageMaker-ExecutionRole-).

If you do not have access to create the execution role and your administrator created the execution role for you, choose Enter a custom IAM role ARN and enter the administrator-provided ARN for the execution role.

Under Task type, for Task category, choose Text.
For Task Selection, choose Text classification.
Choose Next.
On the Select workers and configure tool screen, enter the following:
For Worker Types, choose Amazon Mechanical Turk.
1. For Price per task, choose the default for Low Complexity Tasks.
2. Select The dataset does not contain adult content.
3. Select By selecting to have your dataset annotated by Amazon Mechanical Turk workers, you understand and agree…
4. Do not select Enable automated data labeling.
5. Under Additional configuration, for Number of workers per dataset object, enter 9.

On the same page, under the Text classification labeling section, in the left section, enter the following text:

Based on the general subject or topic of each sentence presented, please classify it as only one of the following: person, animal, or plant.

Examples:

Growing your own vegetables and fruits is a smart choice. - plant

A liger is the offspring of a lion and tiger. - animal

Einstein once said, Two things are infinite: the universe and human stupidity; and I'm not sure about the universe. – person

In the top section, enter the following text:

Based on the general subject or topic of each sentence presented, please classify it as only one of the following: person, animal, or plant. See the examples for more info.

Under Select an option, add the labels person, animal, and plant.
Choose Create.

For this portion of the walkthrough, you incur the following charges:

Seven objects to label, at $0.08 per object (totaling $0.56)
Seven objects labeled by nine unique workers, at $0.012 per worker (totaling $0.77)

The total charge is $1.32.

For more information about pricing, see Amazon SageMaker Ground Truth Pricing and Amazon SageMaker Pricing.

You can now see in the Labeling Jobs section that your new labeling job has the status In Progress, which means that the job is underway.

For more information about MTurk-based workers, see Using the Amazon Mechanical Turk Workforce.

When the job first starts, you see 0/7 labeled objects. When it’s complete, you see 7/7 with the status Complete. When the job is complete, proceed to the next section.

Analyzing worker labeling efficiency

After the labeling job has completed, you can review the labeling results by using the MTurk workforce. Amazon SageMaker stores all the output in the directory you specified when you created the labeling job—for this post, the output directory in your S3 bucket.

Output directory structure overview

If you view the S3 bucket used in this tutorial, and navigate to output/person-animal-plant/annotations/worker-response/iteration-1, you see seven subdirectories. Each correlates to a unique object referenced in the dataset.manifest file. You can find the results of labeling the first sentence in dataset.manifest in the 0 directory, the results of the second in the 1 directory, and so forth.

For more information, see Output Data.

Manually identifying worker accuracy

Review the JSON file located in the 0 directory. These results may be different than your own because you used different workers.

The content in the JSON file looks similar to the following code:

...
,{
  "answers": [{
    "answerContent": {
      "crowd-classifier": {
        "label": "person"
      }
    },
    "submissionTime": "2019-09-27T02:41:34Z",
    "workerId": "public.us-east-1.ABC"
  }, {
    "answerContent": {
      "crowd-classifier": {
        "label": "person"
      }
    },
    "submissionTime": "2019-09-27T02:42:03Z",
    "workerId": "public.us-east-1.XYZ"
  },
  ...

This code example contains the results of two workers that were tasked to label the 0th (first) sentence in the dataset.manifest file as either a person, animal, or plant. That sentence is the following:

{"source":"His nose could detect over 1 trillion odors!"}

Because this sentence refers to someone’s nose, workers should label it as a person. From the JSON, you can see that worker ABC did indeed label it as a person. The next line shows that worker XYZ did the same.

You can examine the rest of the file and see if any other workers labeled it as something different. You can gain insight on who labeled what and at what accuracy level by also examining the other files.

Automating identification of worker accuracy

For the results of this post, the first two sentences (which correlate with the JSON output of directories 0 and 1) are unanimously labeled correctly, but the results are not unanimous for the third sentence (via the JSON output of directory 2). See the following code:

...
, {
    "answerContent": {
      "crowd-classifier": {
        "label": "animal"
      }
    },
    "submissionTime": "2019-09-27T02:43:41Z",
    "workerId": "public.us-east-1.HJK"
  }, {
    "answerContent": {
      "crowd-classifier": {
        "label": "person"
      }
    },
...

The following results are for the sentence with index 2 (the third sentence in the dataset.manifest file):

{"source":"What did the buffalo say to his son when he went away on a trip? Bison!"}

There appears to be confusion on whether this sentence is referring to a person or an animal. You may want to dig deeper to get a better idea of accuracy on a per-worker basis. This could help differentiate as to whether a task is unclear, or if the worker is not trained well enough to perform the task accurately. For more information, see Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs.

Setting up automated accuracy logic

Ground Truth recorded data in S3 that shows how each worker labeled each object. You also have the consolidated label for each object that Ground Truth calculated based on all annotation jobs. In addition, you have the golden.manifest file as a reference to calculate accuracy.

You can use all this data to set up logic to calculate worker accuracy.

To calculate worker accuracy, use the data mentioned earlier as inputs to Python code running from a Jupyter notebook.

Copying the notebook into the instance

To retrieve the notebook, complete the following steps:

On the Amazon SageMaker console, choose Notebook instances.
For the notebook you created during the setup process, choose Open Jupyter.
In the Jupyter GUI, choose the SageMaker Examples tab.
Scroll down to the Ground Truth Labeling Jobs drop-down menu and expand it.
For Identify Worker Accuracy.ipynb, choose Use.
Choose Create Copy.

The notebook should now be open. You can configure it to run against your specific Amazon SageMaker config.

Configure the Jupyter notebook

The following assumes that Amazon SageMaker and S3 resources are all running from the same Region. If not, the following steps do not work.

To configure your notebook, scroll down to the last cell in the notebook and modify the following paths as appropriate to your personal config:

bucket – Replace ‘gec-sagemaker-blog‘ with your S3 bucket name. Make sure you keep the bucket name quoted.

In the following steps, you set string values that are concatenated to bucket_root_url. Be sure to leave the bucket_root_url variable in place, and only modify the value that is being concatenated to it.

golden_answers – This should point to your golden.manifest file in S3. If you followed the previous labeling job steps, leave this value as the default golden.manifest.
worker_metrics_output – The name of the file that contains the accuracy metrics output. You can leave this value as the default worker_metrics.json.
labeling_job_output_location – The labeling job’s output directory. If you followed the labeling job steps, leave this value as the default output.
labeling_job_name – The name of the labeling job you created. If you followed the labeling job steps, leave this value as default.
Under File, choose Save and Checkpoint.
Under Cell, choose Run All.

You should see some output under the last cell of the notebook reporting results. This originates from line 41 of the write_worker_metrics function. This console print is for convenience so you don’t need to download the worker_metrics.json file from S3 each time you run the analysis.

If this output does become too large and a nuisance to print out in the console, you can comment it out and retrieve the results exclusively from the S3 worker_metrics.json file (or whatever value you gave for worker_metrics_output) in the root of your S3 bucket.

Examining the worker metrics output

Now that you automated this process, you can dive deeper into the results to get some insights into how your workers are performing.

The metrics include an object per worker, with the worker’s ID as the key. The value for each key contains the following attributes:

Total Number of Objects Annotated – This is the sum of all the objects the worker annotated (non-golden standard objects and golden standard objects).
Number of Golden Standard Objects Annotated – This is the number of golden standard objects the worker annotated. To deduce the number of non-golden standard objects the worker annotated, subtract this value from the total number of objects annotated.
Average Golden Standard Accuracy – When a worker annotates a golden standard object, the Python logic tracks the following:
- The number of golden standard objects the worker is asked to annotate.
- The total number of golden standard objects the worker answers correctly.

To calculate the quotient of these two numbers, divide the number of correct golden answers by the total number of golden standard. It is presented as the Average Golden Standard Accuracy score for that user. To get the percentage, multiply this number by 100.

Average Accuracy Compared To Other Workers – For all objects (golden and non-golden) annotated by the worker, the worker’s responses are compared to the consolidated responses calculated by Ground Truth (which is derived from all workers’ annotations.) To get the percentage, multiply this number by 100.

Example output

In this first example, worker public.us-east-1.PDC performed a total of three annotations. The golden standard file referenced two out of the three. For the items they annotated in the golden standard file, they were 100% accurate. For all items they annotated (including items that may have been referenced in the golden answers file), relative to other workers, their accuracy was also 100%. See the following code:

...
 'public.us-east-1.PDC': {'Total Number Of Objects Annotated': 3,
                          'Number Of Golden Standard Objects Annotated': 2,
                          'Average Golden Standard Accuracy': 1.0,
                          'Average Accuracy Compared To Other Workers': 1.0,
                          },
...

In this second example, worker public.us-east-1.SRB performed a total of five annotations. Although they performed more annotations than the previous worker, their accuracy compared to the golden standard, and other workers (75%, and 80%, respectively) wasn’t perfect. See the following code:

...
 'public.us-east-1.SRB': {'Total Number Of Objects Annotated': 5,
                          'Number Of Golden Standard Objects Annotated': 4,
                          'Average Golden Standard Accuracy': 0.75,
                          'Average Accuracy Compared To Other Workers': 0.8,
                          },
...

In this third example, the worker public.us-east-1.XYZ performed only one annotation. The object they performed that annotation against was not in the golden standard set. Therefore, the golden standard accuracy for this worker is 0%. However, the one annotation this worker did do was 100% accurate, relative to the results from the others who worked on it. See the following code.

...
 'public.us-east-1.XYZ': {'Total Number Of Objects Annotated': 1,
                          'Number Of Golden Standard Objects Annotated': 0,
                          'Average Golden Standard Accuracy': 0.0,
                          'Average Accuracy Compared To Other Workers': 1.0,
                          },
...

Cleaning up

Cleaning up helps keep things tidy, and also prevents additional costs from accruing.

Be sure to make an offline copy of the S3 and Jupyter notebook data before you delete them to save any data you’d like to reference at a later time.

To stop incurring S3 costs, delete the bucket you used for the tutorial and all files within it. To stop incurring Jupyter Notebook costs, stop and delete the notebook instance.

Conclusion

This post walked you through setting up and submitting a labeling job in Ground Truth, and from there, analyzing the accuracy of the workers who performed the labeling jobs.

The opportunities to evolve and grow this code are endless. You can add functionality and fixes as you see fit, and if you think others would benefit from them, create a pull request against the repo to make your changes available to everyone who walks through this post.

About the Authors

Dharani Srinivasan is a Software Development Engineer at Amazon. In her spare time, she likes to read all kinds of books and spend time with family.

Geremy Cohen is a Solutions Architect with AWS where he helps customers build cutting-edge, cloud-based solutions. In his spare time, he enjoys short walks on the beach, exploring the bay area with his family, fixing things around the house, breaking things around the house, and BBQing.