Automated Data Labeling

Amazon SageMaker Ground Truth provides automated data labeling using machine learning. SageMaker Ground Truth will first select a random sample of data and send it to humans to be labeled. The results are then used to train a labeling model that attempts to label a new sample of raw data automatically. The labels are committed when the model can label the data with a confidence score that meets or exceeds a threshold you set. Where the confidence score falls below your threshold, the data is sent to human labelers. Some of the data labeled by humans is used to generate a new training dataset for the labeling model, and the model is automatically retrained to improve its accuracy. This process repeats with each sample of raw data to be labeled. The labeling model becomes more capable of automatically labeling raw data with each iteration, and less data is routed to humans. 

Flexibility in how you work with labeling professionals

Amazon SageMaker Ground Truth supports multiple choices for human labeling directly in the SageMaker Ground Truth Console. You can use your private team of labelers for in-house labeling jobs, especially for handling data that needs to stay within your organization.

If you want to scale up to a large number of labelers and your data that does not contain confidential or personally identifiable information, you have access to an on-demand 24x7 workforce of over 500,000 independent contractors worldwide, powered by Amazon Mechanical Turk. Mechanical Turk is a crowdsourcing marketplace that connects your labeling jobs with a distributed workforce who can perform these tasks virtually.

Alternatively, you can use a third-party vendor who specializes in data labeling. These vendors have been screened by Amazon to provide high-quality labels and follow security processes. Labeling services from these vendors are provided through AWS Marketplace. All relevant details are provided including pricing and customer reviews to help you select the best vendor for your needs.

Easy instructions for human labeling

With Amazon SageMaker Ground Truth, you provide labeling guidance to human labelers to help ensure consistency. These detailed instructions are available to labelers within their labeling interface. The instructions include visual examples of good and bad labels to help labelers to produce high-quality and accurate labels. You can update these instructions at any time, which makes it easy to add more detail to tasks that you see some labelers getting wrong or to adjust instructions based on evolving needs. A sample instruction is shown below. 

SamurAI Instructions for Bounding Box

Use workflows to simplify labeling tasks

Amazon SageMaker Ground Truth provides built-in labeling workflows that take human labelers step-by-step through tasks and provide tools to help them produce good results. Built-in workflows are currently available for object detection, image classification, text classification, and semantic segmentation labeling jobs. 

In addition to the built-in workflows, SageMaker Ground Truth gives you the option to upload custom workflows. A custom workflow consists of an HTML interface and an accuracy improvement algorithm, both provided by you. The HTML interface provides the human labelers with all of the instructions and tools they need to complete the labeling task. The accuracy improvement algorithm is a function you write to tell SageMaker Ground Truth how it should assess the quality of labels that humans provide. The algorithm is used to find consensus on what is “right” when the same data is provided to multiple human labelers, as well as to identify and deemphasize labelers who tend to provide poor quality data. You upload both the HTML interface and the accuracy improvement algorithm using the SageMaker Ground Truth console. 

Object Detection

You can use the bounding box workflow to identify and label objects in images. A bounding box is a two-dimensional box drawn around one or more elements of an image. Computer vision models trained from images with labeled bounding boxes learn that the pixels within the box correspond to the specified label. It is a very fast and inexpensive way to label images. However, since the boxes often contain pixels not related to the subject of the label, it can require larger amounts of training data before a model reaches high accuracy.

The picture below shows the bounding box interface with an example task to identify all dogs in a given image. The interface allows you to specify clear examples of good and bad bounding boxes to help keep accuracy high. It also provides a link to the full set of labeling instructions and a clear, streamlined UI for creating bounding boxes. 

Bounding box

Image Classification

Image classification involves categorizing images against a pre-defined set of labels. The task differs from object detection because the entire image is labeled rather than individual elements within the image. Image classification is useful for scene detection models that need to consider the full context of the image. For example, in the image below, labelers are being asked to identify which sport is being played in a given image. 

Image classification

Text Classification

Text classification involves categorizing text strings against a pre-defined set of labels. Categorizing text into different labels is often used for natural language processing (NLP) models that identify things like topics (e.g., product descriptions, movie reviews), entities (e.g., names, places, dates), and sentiment. 

Text classification

Semantic Segmentation

For advanced labeling of images, you can use semantic segmentation to label the exact parts of an image that correspond to what your model needs to learn. Semantic segmentation requires more time and skill than bounding boxes. However, it provides very clean training data by labeling only the pixels associated with the subject. For example, the irregular shape of a car in an image could be captured exactly with semantic segmentation, whereas a bounding box would inevitably include background elements unrelated to the car because the box can only have four straight sides.

Semantic Segmentation

Seamless integration into Amazon SageMaker

Training datasets created with SageMaker Ground Truth can be easily imported into Amazon SageMaker for use in model development and training. 

Amazon SageMaker makes it easy to build machine learning models and get them ready for training by providing everything you need to label your training data quickly and to select and optimize the best algorithm and framework for your application. Amazon SageMaker includes hosted Jupyter notebooks that make it easy to explore and visualize your training data stored in Amazon S3. You can connect directly to data in S3, or use AWS Glue to move data from Amazon RDS, Amazon DynamoDB, and Amazon Redshift into S3 for analysis in your notebook.

To help you select your algorithm, Amazon SageMaker includes the most common machine learning algorithms which have been pre-installed and optimized to deliver up to 10 times the performance you’ll find running these algorithms anywhere else. Amazon SageMaker also comes pre-configured to run TensorFlow, Apache MXNet, PyTorch, and Chainer in Docker containers. You can also download these open source containers to your local environment and use the Amazon SageMaker Python SDK to test your scripts in Local Mode before using Amazon SageMaker for training or hosting your model in production. You also have the option of using your framework.

You can begin training your model with a single click in the Amazon SageMaker console. Amazon SageMaker manages all of the underlying infrastructure for you and can easily scale to train models at petabyte scale. To make the training process even faster and easier, Amazon SageMaker can automatically tune your model to achieve the highest possible accuracy.

Once your model is trained and tuned, Amazon SageMaker makes it easy to deploy in production so you can start generating predictions (a process called inference) for real-time or batch data. Amazon SageMaker deploys your model on auto-scaling clusters of Amazon SageMaker ML instances that are spread across multiple availability zones to deliver both high performance and high availability. Amazon SageMaker also includes built-in A/B testing capabilities to help you test your model and experiment with different versions to achieve the best results.

Amazon SageMaker takes away the heavy lifting of machine learning, so you can build, train, and deploy machine learning models quickly and easily.

Learn more about Amazon SageMaker Ground Truth Pricing

Get started with Amazon SageMaker Ground Truth with no upfront commitments or long-term contracts. For more details, check out the Amazon SageMaker Ground Truth pricing page.

Sign up for a free account

Instantly get access to the AWS Free Tier. 

Sign up 
Start building in the console

Get started building with Amazon SageMaker Ground Truth in the AWS Management Console.

Sign in