How can I troubleshoot Amazon SageMaker Ground Truth labeling errors?

4 minute read
0

I want to troubleshoot Amazon SageMaker Ground Truth labeling errors. -or- My SageMaker workers are idle. -or- It's taking a long time for tasks to show up for my SageMaker workers.

Resolution

SageMaker Ground Truth first sends a batch of 10 tasks to your SageMaker workers for annotations. This batch is used to check and make sure that the labeling job is correctly configured. Then, Ground Truth sends larger batches of tasks to workers for annotations based on the MaxConcurrentTaskCount value.

MaxConcurrentTaskCount defines the maximum number of data objects that can be labeled by human workers at the same time. If you use the console, this parameter is set to 1,000. If you use CreateLabelingJob, you can set this parameter to any integer between 1 and 1,000, inclusive.

After Ground Truth receives the labels, it processes the labels with a consolidation AWS Lambda function. With this function, the final annotations are written to the manifest file or Amazon Simple Notification Service (Amazon SNS) output. Then, Ground Truth loops back to read another batch of tasks based on the MaxConcurrentTaskCount value from the input manifest file or Amazon SNS topic.

Troubleshoot task latency and idle workers

  • Be sure that the MaxConcurrentTaskCount value is set to a size that enables workers to complete the entire batch within the given TaskAvailabilityLifetimeInSeconds. The maximum value for this parameter is 1000.
  • Be sure that NumberOfHumanWorkersPerDataObject is set to a value that fits your use case. For example, if the number of workers per object to label is set to 3, then each object needs to be labeled by three workers. If two of the workers finish the current batch, the next batch isn't assigned until the third worker has finished their batch. If a private worker notices that a job disappears from the portal, the worker might have finished one batch and is idle while they wait for a new batch to be available.
  • Be sure that the TaskAvailabilityLifetimeInSeconds is set to a value that fits your use case. This value represents the total time that the tasks are available to the workers. The maximum value that you can set for this parameter is 864,000 seconds (10 days). It's a best practice to split your input dataset into multiple jobs and point them to same work team under the following conditions:
  • The number of objects in the labeling job is high.
  • Your job failed because the wait time exceeded the TaskAvailabilityLifetimeInSeconds value.
  • Be sure that TaskTimeLimitInSeconds is set to a value that fits your use case. If you need to control the time taken by workers to complete a task to make sure that tasks are annotated and the next batch is assigned, consider setting an appropriate value for this time limit.

Troubleshooting labeling errors

Check permissions

Be sure that you have the right permissions to create a labeling job, access input data, and access the Amazon Simple Storage Service (Amazon S3) bucket for output data. For more information, see Step 1: Before you begin.

Be sure of the following:

  • The Amazon S3 bucket is in the same Region as the Ground Truth labeling job.
  • The bucket has a CORS policy attached. For more information, see CORS permission requirement.

Check the output manifest file

Check the output manifest file that you specified in the S3 bucket to store the output files. In this output dataset, you can see the metadata for any failed annotations that might have led to failed labeling jobs.

Example:

{"source-ref":"s3://sagemaker-output-labeling-bucket-example/example.jpeg","example-metadata":{"retry-count":1,"failure-reason":"ClientError: Annotation tasks expired.  Probable Reasons are 1) TaskAvailabilityLifetimeInSeconds parameter is too small.  2) Reward is too low for workers to work on the task.  3) If you use a custom html template, your template may be broken.  4) Data (image/video/text) sent for annotation is broken or too big, preventing completion.  5) All workers declined the tasks.","human-annotated":"true"}}

Workers are allowed to decline tasks because of unclear instructions, input data being broken (not displaying correctly), or some other issue with the task. If all workers decline, the object is marked as expired and not sent to any other workers.

You can monitor whether workers decline, submit, or return a task using Amazon CloudWatch Events. For more information, see Monitor labeling job status.

Check the input manifest file

Be sure that the input manifest file meets all the listed JSON object requirements. For more information, see Use an input manifest file.


Related information

Create a labeling job

Control the flow of data objects sent to workers

Monitor labeling jobs

AWS OFFICIAL
AWS OFFICIALUpdated a year ago