AWS Machine Learning Blog

Adding a data labeling workflow for named entity recognition with Amazon SageMaker Ground Truth

Launched at AWS re:Invent 2018, Amazon SageMaker Ground Truth enables you to efficiently and accurately label the datasets required to train machine learning (ML) systems. Ground Truth provides built-in labeling workflows that take human labelers step-by-step through tasks and provide tools to help them produce good results. Built-in workflows are currently available for object detection, image classification, text classification, and semantic segmentation labeling jobs.

Today, AWS launched support for a new use case: named entity recognition (NER). NER involves sifting through text data to locate noun phrases called named entities, and categorizing each with a label, such as “person,” “organization,” or “brand.” So, in the statement “I recently subscribed to Amazon Prime,” “Amazon Prime” would be the named entity and could be categorized as a “brand.”

You can broaden this use case to label longer spans of text and categorize those sequences with any pre-specified labels. For example, the following screenshot identifies spans of text in a performance review that demonstrate the Amazon leadership principle “Customer Obsession.”

Overview

In this post, I walk you through the creation of a NER labeling job:

  1. Gather a dataset.
  2. Create the labeling job.
  3. Select a workforce.
  4. Create task instructions.

For this exercise, your NER labeling task is to identify brand names from a dataset. I have provided a sample dataset of ten tweets from the Amazon Twitter account. Alternatively, feel free to bring your own dataset, and define a specific NER labeling task that is relevant to your use case.

Prerequisites

To follow the steps outlined in this post, you need an AWS account and access to AWS services.

Step 1: Gather your dataset and store data in Amazon S3

Gather the dataset to label, save it to a text file, and upload the file to Amazon S3. For example, I gathered 10 tweets, saved them to a text file with one tweet per return-separated line, and uploaded the text file to an S3 bucket called “ner-blog.” For your reference, the following box contains the uploaded tweets from the text file.

Don’t miss the 200th episode of Today’s Deals Live TODAY! Tune in to watch our favorite moments and celebrate our 200th episode milestone! #AmazonLive (link: https://amzn.to/2JQ2vDm) amzn.to/2JQ2vDm
It's the thought that counts, but our Online Returns Center makes gift exchanges and returns simple (just in case!) https: (link: https://amzn.to/2l6qYKG) amzn.to/2l6qYKG
Did you know you can trade in select Apple, Samsung, and other tablets? With the Amazon Trade-in program, you can receive an Amazon Gift Card + 25% off toward a new Fire tablet when you trade in your used tablets. (link: https://amzn.to/2Ybdu1Y) amzn.to/2Ybdu1Y
Thank you, Prime members, for making this #PrimeDay the largest shopping event in our history! You purchased more than 175 million items, from devices to groceries!
Hip hip for our Belei charcoal mask! This staple in our skincare line is a @SELFMagazine 2019 Healthy Beauty Award winner.
Looking to take your photography skills to the next level? Check out (link: http://amazon.com/primeday) amazon.com/primeday for an amazing camera deal.
Is a TV on your #PrimeDay wish list? Keep your eyes on (link: http://amazon.com/primeday) amazon.com/primeday for a TV deal soon.
Improve your musical talents by staying in tune on (link: http://amazon.com/primeday) amazon.com/primeday for an acoustic guitar deal launching soon.
.@LadyGaga’s new makeup line, @HausLabs, is available now for pre-order! #PrimeDay (link: http://amazon.com/hauslabs) amazon.com/hauslabs
#PrimeDay ends tonight, but the parade of deals is still going strong. Get these deals while they’re still hot! (link: https://amzn.to/2lgqZM3) amzn.to/2lgqZM3

Step 2: Create a labeling job

  1. In the Amazon SageMaker console, choose Labeling jobs, Create labeling job.
  2. To set up the input dataset location, choose Create manifest file.
  3. Point to the S3 location of the text file that you uploaded in Step 1, and select Text, Create.
  4. After the creation process finishes, choose Use this manifest, and complete the following fields:
    • Job name—Custom value.
    • Input dataset location—S3 location of the text file to label. (The previous step should have populated this field.)
    • Output dataset location—S3 location to which Amazon SageMaker sends labels and job metadata.
    • IAM Role—A role that has read and write permissions for this task’s Input dataset and Output dataset locations in S3.
  5. Under Task type, for Task Category, choose Text.
  6. For Task selection, select Named entity recognition.

Step 3: Selecting a labeling workforce

The Workers interface offers three Worker types:

The console includes other Workers settings, including Price per task and the optional Number of workers per dataset.

For this demo, use Public. Set Price per task at $0.024. Mechanical Turk workers should complete the relatively straightforward task of identifying brands in a tweet in 5–7 seconds.

Use the default value for Number of workers per dataset object (in this case, a single tweet), which is 3. SageMaker Ground Truth asks three workers to label each tweet and then consolidates those three workers’ responses into one high-fidelity label. To learn more about consolidation approaches, see Annotation Consolidation.

Step 4: Creating the labeling task instructions

While critically important, effective labeling instructions often require significant iteration and experimentation. To learn about best practices for creating high-quality instructions, see Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs. Our exercise focuses on identifying brand names in tweets. If there are no brand names in a specific tweet, the labeler has the option of indicating there are no brands in the tweet.

An example of labeling instructions is shown on the following screenshot.

Conclusion

In this post, I introduced Amazon SageMaker Ground Truth data labeling. I showed you how to gather a dataset, create a NER labeling job, select a workforce, create instructions, and launch the job. This is a small labeling job with only 10 tweets and should be completed within one hour by Mechanical Turk workers. Visit the AWS Management Console to get started.

As always, AWS welcomes feedback. Please submit comments or questions below.


About the Author

Vikram Madan is the Product Manager for Amazon SageMaker Ground Truth. He focusing on delivering products that make it easier to build machine learning solutions. In his spare time, he enjoys running long distances and watching documentaries.