Amazon SageMaker Ground Truth Plus

General

Q: What is Amazon SageMaker Ground Truth Plus?

Amazon SageMaker Ground Truth Plus allows you to easily create high-quality training datasets without having to build labeling applications or manage labeling workforces on your own. Once you provide data along with labeling requirements, SageMaker Ground Truth Plus handles setting up the data labeling workflows and managing them on your behalf, in accordance with your requirements. From there, an expert workforce that is trained on a variety of machine learning (ML) tasks does data labeling. Ground Truth Plus uses ML techniques, including active-learning, pre-labeling, and machine validation. This increases the quality of the output dataset and decreases the data labeling costs. Ground Truth Plus provides transparency into your data labeling operations and quality management. With it, you can review the progress of training datasets across multiple projects, track project metrics, such as daily throughput, inspect labels for quality, and provide feedback on the labeled data. Ground Truth Plus can be used for a variety of use cases, including computer vision, natural language processing, and speech recognition.

Q: Why should I use Amazon SageMaker Ground Truth Plus?

To train a machine learning (ML) model, data scientists need large, high-quality, labeled datasets. As ML adoption grows, labeling needs increase. This forces data scientists to spend weeks on building data labeling workflows and managing a data labeling workforce. Unfortunately, this slows down innovation and increases cost. To ensure data scientists can spend their time building, training, and deploying ML models, data scientists typically task other in-house teams consisting of data operations managers and program managers to produce high-quality training datasets. However, these teams typically don't have access to skills required to deliver high-quality training datasets, which affects ML results.

Amazon SageMaker Ground Truth Plus makes it easy for data scientists as well as business managers, such as data operations managers and program managers, to create high-quality training datasets by removing the undifferentiated heavy lifting associated with building data labeling applications and managing the labeling workforce. All you do is share data along with labeling requirements and Ground Truth Plus sets up and manages your data labeling workflow, based on these requirements. From there, an expert workforce that is trained on a variety of ML tasks performs data labeling. You don't even need deep ML expertise or knowledge of workflow design and quality management to use Ground Truth Plus.

Q: How do I get started with Amazon SageMaker Ground Truth Plus?

To get started with Amazon SageMaker Ground Truth Plus, please complete the project requirement form. Our team will reach out to you to discuss your data labeling project.

Q: How does Amazon SageMaker Ground Truth Plus help me manage my training datasets?

Amazon SageMaker Ground Truth Plus provides increased transparency into data labeling operations and quality management. For example, SageMaker Ground Truth Plus provides a project view, which you can use to monitor progress of the training dataset across different projects. In addition, a real-time metrics dashboard allows you to track detailed project metrics, including daily throughput. SageMaker Ground Truth Plus also provides a user interface that allows you to inspect labels for quality, and provide real-time feedback. Finally, with streaming mode, you can get same-day or same-hour label turnaround for certain types of workloads.

Q: How does Amazon SageMaker Ground Truth Plus help with increasing the accuracy of my training datasets?

Ground Truth Plus uses multiple techniques to increase the accuracy of the training dataset:

  • ML Techniques: Ground Truth Plus uses ML techniques, including active learning, pre-labeling and machine validation, which increases the quality of the output dataset and decreases the data labeling costs. A multi-step labeling workflow includes ML models for active learning that allows Ground Truth Plus to reduce costs by selecting which items to label and ML models to pre-label selected data that reduces human effort. Ground Truth Plus uses machine validation to identify potential errors that are then sent for an additional round of human review. This significantly improves label quality by catching human errors.
  • Intuitive Labeling Interface: Ground Truth Plus uses assistive labeling features such as (1) Snapping, that snaps an imperfect 3D cuboid to tightly cover the enclosing object. (2) Auto-Segmentation, that completes an object mask with just four extreme points clicks.

What is the difference between SageMaker Ground Truth and SageMaker Ground Truth Plus?

• SageMaker Ground Truth Plus is a fully-managed turnkey service, in which AWS experts set up and manage your workflows and an external workforce of data labelers. It has a guaranteed SLA on quality, timeline for label delivery, as well as custom pricing. SageMaker Ground Truth is a self-serve option where customers can set up their own workflows, choose from prebuilt labeling UIs or develop their own, and manage their own internal workforce. They can also source the workforce from Mechanical Turk or a vendor in the AWS Marketplace. Pricing in SageMaker Ground Truth is according to the public pricing schedule.

Data Privacy

Q: How does Amazon SageMaker Ground Truth Plus help protect and secure my data?

By default, Amazon SageMaker Ground Truth Plus encrypts data stored in an Amazon S3 bucket at rest and in transit. In addition, access to your data is controlled using AWS Identity and Access Management (IAM). Your data is stored in an independent AWS account, and an Amazon S3 bucket is created for your project. Amazon SageMaker Ground Truth Plus does not store or make copies of your data outside the AWS environment created for you. AWS logs and audits all access to your data using Amazon S3 access logging and AWS CloudTrail.

Q: Who has access to my content that is processed and stored by Amazon SageMaker Ground Truth Plus?

Authorized AWS employees and the expert workforce labeling your data will have access to your content processed by Amazon SageMaker Ground Truth Plus. The expert workforce that labels your data will view and label it through the secure SageMaker Ground Truth worker portal. Access through the worker portal allow workers to only view and label the data, they can’t modify or delete your data. Your trust, privacy, and the security is our highest priority. We implement appropriate technical and physical controls, including encryption at rest and in transit, designed to prevent unauthorized access to, or disclosure of, your content.

Q: Are data (images, text files, videos, etc.) inputs processed by Amazon SageMaker Ground Truth Plus stored, and how are they used by AWS?

Amazon SageMaker Ground Truth Plus stores the raw and processed content only for the duration of your projects and will delete content associated with your data labeling project upon request. Amazon SageMaker Ground Truth Plus uses your content solely to provide and maintain the service. Amazon SageMaker Ground Truth Plus never uses your content or any model trained on that content for the benefit of other customers.

Q: Is the content processed by Amazon SageMaker Ground Truth Plus moved outside the AWS region where I am using Amazon SageMaker Ground Truth Plus?

Any content processed by Amazon SageMaker Ground Truth Plus is encrypted and stored at rest in the AWS region where you are using Amazon SageMaker Ground Truth Plus. Unless you specify otherwise in any data localization requirements mutually agreed upon through a statement of work, your content may be accessed outside the AWS region your content is stored in to perform the labeling service.

Q: Can I request deletion of data (images, text files, videos, etc.) stored by Amazon SageMaker Ground Truth Plus?

Yes. You can request deletion of raw and processed data inputs associated with your data labeling project by contacting AWS Support.

Q: Do I still own my content that is processed and stored by Amazon SageMaker Ground Truth Plus?

Yes. You always retain ownership of your content, and we will only use your content with your consent.

Q: Can I process personal health information (PHI) data through Amazon SageMaker Ground Truth Plus?

No. Currently, Amazon SageMaker Ground Truth Plus is not a HIPAA eligible service.

Workforce

Q. What is an expert workforce in Amazon SageMaker Ground Truth Plus?

With Ground Truth Plus, labeling is done by a highly skilled, diverse, and elastic workforce that is trained on machine learning tasks that can help meet a wide variety of your needs, including data security, privacy, and compliance. The workforce consists of two tiers, 1/Amazon workforce: consists of workers who are employed and managed by Amazon, wherein Amazon owns the operations, quality and turnaround time SLAs on your behalf. 2/Vendor workforce: consists of workers provided by a curated list of third-party vendors that specialize in providing data labeling services, wherein Amazon owns the quality and turnaround time SLAs on your behalf.

Q: Who decides which workforce tier will be used for my Amazon SageMaker Ground Truth Plus project?

You can decide the type of workforce to be used for your project. Unless you instruct us to use a specific workforce, we may use Amazon workforce, Vendor workforce or a combination of both workforces to help meet your project’s quality, turnaround time and security requirements.

Q: What changes are vendor workforce implementing in light of COVID-19 that I need to be aware of?

In light of COVID-19, some service providers have implemented a remote work policy for the health and safety of their employees.

Q: What security standards is a vendor workforce required to meet?

Service providers are required to go through a SOC 2 compliance or ISO 27001 certification on an annual basis by an independent third-party auditor.

The SOC 2 report is a description of the service provider’s control environment based on the American Institute of Certified Public Accountants (AICPA) Trust Services Criteria - Security, Availability, Processing Integrity, Confidentiality, and Privacy.

The ISO 27001 certification is based on the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), which details requirements for establishing, implementing, maintaining, and continually improving an information security management system (ISMS).

In addition to independently obtaining SOC 2 or ISO 27001, service providers are required to maintain additional security controls, which are described below, to help keep your data secure.

Technology Controls:
Service providers are required to utilize the appropriate software to block any attempts made to download or copy files/data from their system and prevent unauthorized access to their systems. Service providers are also required to prohibit their workforce from storing or copying customer task-related data.

Network Security Controls:
We require the service provider’s network to be designed to prevent remote access to customer's task-related data. Further, peer-to-peer file sharing software is blocked on the provider's network, and the firewall should be designed in a way to provide high availability.

Employee Controls:
Service providers are required to ensure they have Non-Disclosure Agreements (NDAs) with their employees. Service providers are required to adopt stringent policies to prevent any information leakage and prevent employees from transmitting information by any means: paper, USBs, mobile phones, or any other media.

Physical Access Controls:
Service providers are required to maintain physical access control measures to prevent unauthorized access to their production site. These may include turnstiles with biometric authentication, employee badge identification, etc.

Q: How does AWS help a vendor workforce meet these security standards?

AWS requests that service providers furnish their SOC 2 or ISO 27001 certification reports before becoming a part of Amazon SageMaker Ground Truth Plus’s vendor workforce. AWS SOC reports and ISO certifications do not cover the vendor workforce.

Amazon SageMaker Ground Truth

General

Q: What is Amazon SageMaker Ground Truth?

Amazon SageMaker Ground Truth makes it easy for you to efficiently and accurately label the datasets required for training machine learning systems. SageMaker Ground Truth can automatically label a portion of the dataset based on the labels done manually by human labelers. You can choose to use a crowdsourced Amazon Mechanical Turk workforce of over 500,000 labelers, your own employees , or one of the third party data labeling service providers listed on AWS Marketplace, pre-screened by Amazon. SageMaker Ground Truth uses innovative algorithms and user experience (UX) techniques to improve the accuracy of human labeling. Over time, the model gets progressively better by continuously learning from the labels created by humans, for increased automatic labeling.

Q: What is Automated Data Labeling?

Automated data labeling is labeling of data using machine learning. Amazon SageMaker Ground Truth will first select a random sample of data and send it to humans to be labeled. The results are then used to train a labeling model that attempts to label a new sample of raw data automatically. The labels are committed when the model can label the data with a confidence score that meets or exceeds a high threshold. Where the confidence score falls below this threshold, the data is sent to human labelers. Some of the data labeled by humans is used to generate a new training dataset for the labeling model, and the model is automatically retrained to improve its accuracy. This process repeats with each sample of raw data to be labeled. The labeling model becomes more capable of automatically labeling raw data with each iteration, and less data is routed to humans.

Using Amazon SageMaker Ground Truth

Q: Why should I use Amazon SageMaker Ground Truth?

Prior to building, training, and deploying machine learning models, you need data. Successful models are built on high-quality training data, and collecting and labeling the training datasets involves a lot of time and effort. To build the training datasets, human labelers need to evaluate a large number of images or other data types, and then identify and label particular objects in each data type. These labeling tasks are distributed across many human labelers, adding significant overhead and cost. If there are incorrect labels, the system will learn from the bad information and make inaccurate predictions.

Amazon SageMaker Ground Truth solves this problem by making it easy to efficiently perform highly accurate data labeling using data stored in Amazon S3, using a combination of automated data labeling and human-performed labeling.

Q: How do I get started with Amazon SageMaker Ground Truth?

Amazon SageMaker Ground Truth provides a managed experience where you can set up an entire data labeling job with just a few steps. To get started with Amazon SageMaker Ground Truth, you sign into the AWS Management Console and navigate to the SageMaker console. From there, select Labeling jobs under Ground Truth. Here you can create a labeling job. First as part of the labeling job creation flow, you provide a pointer to the S3 bucket that contains your dataset to be labeled. Ground Truth offers templates for common labeling tasks where you only need to click a few choices and provide minimal instructions on how to get your data labeled. Alternatively, you can create your own custom template. As the last step of creating a labeling job, you select one of the three human workforce options: (1) a public crowdsourced workforce, (2) a curated set of third party data labeling service providers , or (3) bring your own workers. You also have the option to enable automated data labeling.

Q:  How are my training datasets managed using Amazon SageMaker Ground Truth?

Amazon SageMaker Ground Truth manages the metadata, associated labels, and a taxonomy of your labels and datasets. You can easily use the AWS SDK through a SageMaker Notebook or the Ground Truth console within the SageMaker console to query and manage your datasets and labels. Visit the Amazon SageMaker Ground Truth documentation for more information.

Q:  How does Amazon SageMaker Ground Truth help with increasing the accuracy of my training datasets?

Amazon SageMaker Ground Truth offers the following features to help you increase the accuracy of data labeling performed by humans:

(a) Annotation consolidation: This counteracts the error/bias of individual workers by sending each data object to multiple workers and then consolidating their responses (called “annotations”) into a single label. It then takes their annotations and compares them using an annotation consolidation algorithm. This algorithm first detects outlier annotations that are disregarded. It then performs a weighted consolidation of the annotations, assigning higher weights to more reliable annotations. The output is a single label for each object.

(b) Annotation interface best practices: These are features of the annotation interfaces that enable workers to perform their tasks more accurately. Human workers are prone to error and bias, and well-designed interfaces improve worker accuracy. One best practice is to display brief instructions along with good and bad label examples in a fixed side panel. Another best practice is to darken the area outside of the box bounding boundary when workers are drawing the bounding box on an image.

Q:  How does Amazon SageMaker Ground Truth ensure that my data is protected and secure?

By default, Amazon SageMaker Ground Truth encrypts your data at rest and in transit. In addition, access to your data can be controlled using AWS Identity and Access Management (IAM). Ground Truth does not store or make copies of your data outside of your AWS environment, and your data remains in your control. Further, Ground Truth supports compliance standards such as General Data Protection Regulation (GDPR), and provides comprehensive logging and auditing capabilities using Amazon CloudWatch and Amazon CloudTrail. Visit the Amazon SageMaker Ground Truth documentation for more information.

Q:   How do I access a human workforce using Amazon SageMaker Ground Truth?

From SageMaker Ground Truth, you can choose any of the three workforce options namely (1) Public crowdsourced workforce through Amazon Mechanical Turk; (2) Third party data labeling service providers available through AWS Marketplace; and (3) Your own employees. Visit the Amazon SageMaker Ground Truth documentation for more information.  

Using Third Party Data Labeling Service Providers

Q:   Can Amazon SageMaker Ground Truth data labeling service providers process confidential data?

Yes, Amazon SageMaker Ground Truth data labeling service providers can process confidential data. The Standard Service Agreement between AWS customers and the third party data labeling service provider contains some basic protections for your confidential information. Please review those terms before sharing any confidential information with the service provider. The terms are located on the listing page for the service provider on AWS Marketplace.

Q:   I am working with a third-party service provider through AWS Marketplace. What changes are service providers implementing in light of COVID-19 that I need to be aware of?

In light of the rapidly evolving impact of COVID-19, some service providers have implemented a remote work policy for the health and safety of their employees temporarily. During this time, security standards including SOC 2 compliance and additional security controls outlined in the below FAQ may not be applicable to the affected service providers. Impacted service providers have updated their AWS Marketplace listings to reflect this, and will not process customer data from remote work environments without explicit customer consent.

Q:   What security standards are Amazon SageMaker Ground Truth data labeling service providers required to meet?

Data labeling service providers are required to go through SOC 2 compliance and certification on an annual basis. The SOC 2 report is a description of the service provider’s control environment based on the American Institute of Certified Public Accountants (AICPA) Trust Services Criteria - Security, Availability, Processing Integrity, Confidentiality, and Privacy.

In addition to SOC 2, service providers are required to maintain these additional security controls to help keep customer data secure.

Technology Controls:
Service providers are required to utilize the appropriate software to block any attempts made to download or copy files/data from their system and prevent unauthorized access to their systems. Service providers are also required to prohibit their workforce from storing or copying customer task-related data.

Network Security Controls:
We require the service provider’s network to be designed to prevent remote access to customer's task-related data. Further, peer-to-peer file sharing software is blocked on the provider's network, and the firewall should be designed in a way to provide high availability.

Employee Controls:
Service providers are required to ensure they have Non-Disclosure Agreements (NDAs) with their employees. Service providers are required to adopt stringent policies to prevent any information leakage and prevent employees from transmitting information by any means: paper, USBs, mobile phones, or any other media.

Physical Access Controls:
Service providers are required to maintain physical access control measures to prevent unauthorized access to their production site. These may include turnstiles with biometric authentication, employee badge identification, etc.

Q:   How does AWS help ensure service providers meet these security standards?

AWS requests that service providers furnish their SOC 2 certification reports prior to being listed in the marketplace and confirms:

Authenticity (if service provider auditor is certified by the AICPA);

Report period (SOC 2 certification validity date); and

Production site (the physical site where the service provider workforce will work on Amazon SageMaker Ground Truth labeling tasks).

Q:   What is the frequency of review of service provider security standards?

The security standards from every service provider are reviewed annually to ensure they meet the mandatory requirements.

Q:   Are there any exceptions to the AWS review?

No. If the service provider fails to meet security standards, then their listing will be removed from AWS Marketplace. De-listing will be completed within 24 hours and all active customers will be notified by email.

Q:   If the service provider offers data labeling services through multiple production sites, do all sites need to go through the review process?

Yes, all sites need to meet the required security standards.

Q:   What happens if there is a data breach at the service provider production site?

The service provider will inform AWS and affected customers within 24 hours of detecting any actual or suspected unauthorized access, collection, acquisition, use, transmission, disclosure, corruption, or loss of customer information. The service provider will remedy each security incident promptly and provide AWS and affected customers written details about the internal investigation.

Pricing and Availability

Q: How much does Amazon SageMaker Ground Truth cost?

Please see the SageMaker Ground Truth pricing page for the current pricing information.

Q: In which AWS regions is Amazon SageMaker Ground Truth available?

The AWS Region Table lists all the AWS regions where Amazon SageMaker Ground Truth is currently available.

Synthetic data generation

Q: How can I generate labeled synthetic data?

Amazon SageMaker Ground Truth can generate labeled synthetic data on your behalf. You specify your synthetic image requirements or provide 3D assets and baseline images, such as computer-aided design (CAD) images, and AWS digital artists create images from scratch or use customer-provided assets. The generated images imitate pose and placement of objects, include object or scene variations, and optionally add specific inclusions, such as scratches, dents, and other alterations, eliminating the time-consuming process of collecting data or the need to damage parts to acquire images. SageMaker Ground Truth can generate hundreds of thousands of synthetic images that are automatically labeled with high accuracy.

Q: Why should I use labeled synthetic data?

Sourcing data for training machine learning (ML) models takes significant time and effort. For some types of data, such as rare or highly variable scenarios, data gathering can be expensive or even impossible. For example, identifying manufacturing defects requires a large quantity of images. In addition, ML models need to be trained to recognize scenarios that do not frequently occur, such as rare defects. To identify rare defects, ML models need images of defects; however, because these events occur infrequently, this data is often manually created, which can require damage to expensive parts. Finally, images need to be manually labeled.

Using SageMaker Ground Truth, you can generate synthetic data that is automatically labeled, reducing the time and expense involved in gathering and labeling training data. You can then use synthetic data to train ML models across a wide range of computer vision use cases, such as object, anomaly, and defect detection.

Q: How does SageMaker Ground Truth generate labeled synthetic data?

There is a three-step process to generate labeled synthetic data. First, you provide 3D assets, baseline images, and/or image requirements. Second, digital artists convert those inputs to 3D assets, adding inclusions such as scratches, dents, and textures. Third, SageMaker Ground Truth generates synthetic images and automatically labels them.

Q: Can I use SageMaker Ground Truth to generate labeled synthetic data if I don’t have images or 3D assets?

Yes, there is a 3D asset library of more than 1 million objects that can be used to support the creation of synthetic data on your behalf. Alternatively, you can use a small set of pre-labeled images to create new synthetic datasets. If you have background images or examples of the data you need, that can expedite the creation of highly accurate synthetic data.

Generative AI

Q: How can I use Amazon SageMaker Ground Truth Plus to build my generative AI applications?

SageMaker Ground Truth Plus helps you generate high quality datasets to customize and align foundation models with human preferences. There are two types of labeled datasets that Amazon SageMaker Ground Truth generates, demonstration data and preference data.

In demonstration data, a data annotator completes a task (such as writing questions and answers, or summarizes text) that simulates and demonstrates how a model would interact with a human. The labeled dataset is then used to fine-tune the model in a process known as supervised fine tuning (SFT).

In preference data, a human annotator gives direct feedback and guidance on content that a model has generated, or on simulated model data. For example, ranking text responses from a large language model according to specific dimensions such as accuracy, relevance, or writing clarity. One fine-tuning method that uses preference data is called reinforcement learning from human feedback (RLHF).

Q: What generative AI use cases can Amazon SageMaker Ground Truth Plus support?

Amazon SageMaker Ground Truth Plus allows you to generate datasets for large language models  (LLMs), text-to-image models, and text-to-video models. For large language models, data annotators can create demonstration datasets for supervised fine tuning, including question and answer pairs, summarizing text, reworking text for red-teaming, or changing style and voice. Annotators can also create preference datasets for RLHF by ranking LLM responses to ensure chatbots are aligned with human preferences. For text-to-image and text-to-video models, data annotators can create rich captions datasets. These datasets are then used to train models on how to generate images and videos that are more closely aligned with the user’s original text input. Data annotators can also generate preference datasets, containing images and videos that are ranked along customer-specified dimensions such as specific aesthetic attributes. You can also request a new task type not already covered and our team will work with you to create a workflow that meets your needs.

Q: Why is human feedback important for foundation models?

Humans are typically both the requester and consumer of content in generative AI applications. It is therefore critical that humans teach foundation models how to respond correctly according to users’ prompts. By fine-tuning and customizing models with labeled data, data annotators can simulate the style, length, and accuracy of how a model should interact with users. For example, to create a chatbot, data annotators teach the model how to respond to questions and provide answers by training it on human-written questions and answers. Data annotators also rank the different chatbot responses based on their alignment with human preferences to teach the model how to write according to human intent and values, which can be done through reinforcement learning from human feedback (RLHF).
 

Amazon SageMaker Ground Truth pricing
Learn more about Amazon SageMaker Data Labeling Pricing

Get started with Amazon SageMaker Data Labeling with no upfront commitments or long-term contracts.

Learn more 
Sign up for an AWS account
Sign up for a free account

Instantly get access to the AWS Free Tier. 

Sign up 
Start building in the console
Start building in the console

Get started building with Amazon SageMaker Data Labeling in the AWS Management Console.

Sign in