AWS Machine Learning Blog

Gamify Amazon SageMaker Ground Truth labeling workflows via a bar chart race

Labeling is an indispensable stage of data preprocessing in supervised learning. Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning. Ground Truth helps improve the quality of labels through annotation consolidation and audit workflows. Ground Truth is easy to use, can reduce your labeling costs by up to 70% using automatic labeling, and provides options to work with labelers inside and outside of your organization.

This post explains how you can use Ground Truth partial labeling data loaded in Amazon Simple Storage Service (Amazon S3) to gamify labeling workflows. The core of the gamification approach is to create a bar chart race showing the progress of the labeling workflow and highlighting the evolution of completed labeling per workers. The bar chart race can be sent periodically (such as daily or weekly). We present options to create and send your bar chart manually or automatically.

This gamification approach to Ground Truth labeling workflows can allow you to:

  • Speed up labeling
  • Reduce delays in labeling by continuous monitoring
  • Increase user engagement and user satisfaction

We have successfully adopted this solution for a healthcare and life science customer. The labeling job owner kept the internal labeling team engaged by sending a bar chart race daily, and the labeling job was completed 20% faster than planned.

Option 1: Manual chart creation

A first option for gamifying your Ground Truth labeling workflow via a bar chart race is to create an Amazon SageMaker instance to fetch the partial labeling data, parse the data and create the bar chart race manually. You then save it to Amazon S3 and send it to the workers. The following diagram shows this workflow.

To create your bar chart race manually, complete the following steps:

  1. Create a Ground Truth labeling job and indicate an S3 bucket where the labeling data is continuously loaded.
  2. Create a SageMaker notebook instance.
    1. Attach the appropriate AWS Identity and Access Management (IAM) role to allow read access to the S3 bucket containing the outputs of the Ground Truth labeling job.
  3. Create a notebook using a conda_python3 based kernel, then install the required dependencies. You can run the following commands from the terminal after activating the appropriate environment:
$cd /home/ec2-user/SageMaker/
$source activate python3
$pip install bar_chart_race
$pip install ffmpeg-python
$sudo su -
$cd /usr/local/bin
$mkdir ffmpeg
$cd ffmpeg
$wget https://www.johnvansickle.com/ffmpeg/old-$releases/ffmpeg-4.2.1-amd64-static.tar.xz
$tar xvf ffmpeg-4.2.1-amd64-static.tar.xz
$mv ffmpeg-4.2.1-amd64-static/ffmpeg .
$ln -s /usr/local/bin/ffmpeg/ffmpeg /usr/bin/ffmpeg
$exit
  1. Import the required packages via the following code:
import boto3
import json
import pandas as pd 
import numpy as np
  1. Set up the SageMaker notebook instance to access the S3 bucket containing the Ground Truth labeling data (for this post, we use the bucket Example_SageMaker_GT):
s3 = boto3.client('s3')
bucket_name = 'Example_SageMaker_GT'
prefix = '/annotations/worker-response/iteration-1/'
  1. Analyze the partial Ground Truth labeling data:
s3_res = boto3.resource('s3')
paginator = boto3.client('s3').get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name, Prefix=prefix)

times = []
subs = []
for page in pages:
    for work in page['Contents']:
        content_object = s3_res.Object(bucket_name, work['Key'])
        file_content = content_object.get()['Body'].read().decode('utf-8')
        json_content = json.loads(file_content)
        times.append(json_content['answers'][0]['submissionTime'])
        subs.append(json_content['answers'][0]['workerMetadata']['identityData']['sub'])
sub_map = { s: f'Name {i}' for i,s in enumerate(np.unique(subs))}
  1. Convert the partial Ground Truth labeling data into a DataFrame and structure the date and hours fields:
df = pd.DataFrame({'times':times,'subs':subs})
df["subs"] = df["subs"].map(sub_map)
subs_df = pd.DataFrame(pd.Series(subs))
df['date'] = pd.to_datetime(df.times).dt.date
df['hours'] = pd.to_datetime(df.times).dt.strftime('%Y-%m-%d %H:30')
  1. Extract the labeling occurrence per workers and calculate the cumulative sum:
counts_per_sub_per_date = df.groupby(['hours','subs'])['count'].count().unstack()
counts_per_sub_per_date_cum = counts_per_sub_per_date.fillna(0).cumsum()
  1. Create the bar chart race video:
import bar_chart_race as bcr

bcr.bar_chart_race(
    df=counts_per_sub_per_date_cum,
    filename=None,
    orientation='h',
    sort='desc',
    #n_bars=len(counts_per_sub.columns),
    fixed_order=False,
    fixed_max=True,
    steps_per_period=5,
    interpolate_period=False,
    label_bars=True,
    bar_size=.95,
    period_label={'x': .99, 'y': .25, 'ha': 'right', 'va': 'center'},
    #period_fmt='%B %d, %Y',
    period_summary_func=lambda v, r: {'x': .99, 'y': .18,
                                      's': f'Total labels: {v.sum():,.0f}',
                                      'ha': 'right', 'size': 8, 'family': 'Courier New'},
    perpendicular_bar_func='median',
    period_length=50,
    figsize=(5, 3),
    dpi=144,
    cmap='dark12',
    title='Who is going to be the top labeller?',
    title_size='',
    bar_label_size=7,
    tick_label_size=7,
    shared_fontdict={'family' : 'Helvetica', 'color' : '.1'},
    scale='linear',
    writer=None,
    fig=None,
    bar_kwargs={'alpha': .7},
    filter_column_colors=False)  
  1. Download the bar chart race output and save it into an S3 bucket.
  2. Email this file to your workers.

Option 2: Automatic chart creation

Option 2 requires no manual intervention; the bar chart races are sent automatically to the workers at a fixed interval (such as every day or every week). We provide a completely serverless solution, where the computing is done through AWS Lambda. The advantage of this approach is that the you don’t need to deploy any computing infrastructure (the SageMaker notebook instance in the first option). The steps involved are as follows:

  1. A Lambda function is triggered at fixed time intervals, and generates the bar chart race by replicating the steps highlighted in Option 1. External dependencies, such as ffmpeg, are installed as Lambda layers.
  2. The bar chart races are saved to Amazon S3.
  3. The updates to the video on Amazon S3 trigger a message sent to Amazon Simple Notification Service (Amazon SNS).
  4. Amazon SNS sends an email to subscribers.

The following diagram illustrates this architecture.

The following is the code for the Lambda function:

import boto3
import json 
import os

import numpy as np 
import pandas as pd

from matplotlib import pyplot as plt
from matplotlib import animation
import bar_chart_race as bcr

# point this to the path in your lambda layer
plt.rcParams['animation.ffmpeg_path'] = '/opt/ffmpeg/bin/ffmpeg'

s3_res = boto3.resource('s3')


bucket_name = 'YourBucketHere'
prefix = 'GTFolder/annotations/worker-response/iteration-1/'

def lambda_handler(event, context):
    
    print(os.environ)
    print(os.getcwd())
    print(os.listdir('/opt/'))
    
    paginator = boto3.client('s3').get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=bucket_name, Prefix=f'{prefix}')
        
    times = []
    subs = []
    for page in pages:
        for work in page['Contents']:
            content_object = s3_res.Object(bucket_name, work['Key'])
            file_content = content_object.get()['Body'].read().decode('utf-8')
            json_content = json.loads(file_content)
            times.append(json_content['answers'][0]['submissionTime'])
            subs.append(json_content['answers'][0]['workerMetadata']['identityData']['sub'])
        
    # this is where one would map back to the real names of the labelers, possibly
    # using Cognito for sub -> Name correspondence
    
    sub_map = { s: f'Name {i}' for i,s in enumerate(np.unique(subs))}
    
    df = pd.DataFrame({'times':times,'subs':subs})
    df["subs"] = df["subs"].map(sub_map)
    df['date'] = pd.to_datetime(df.times).dt.date
    df['hours'] = pd.to_datetime(df.times).dt.strftime('%Y-%m-%d %H:30')
    df['count']=1
    
    counts_per_sub_per_date = df.groupby(['hours','subs'])['count'].count().unstack()
    counts_per_sub_per_date_cum = counts_per_sub_per_date.fillna(0).cumsum()
    
    bcr.bar_chart_race(df=counts_per_sub_per_date_cum.iloc[:100],
        filename='/tmp/barchart.mp4',
        orientation='h',
        sort='desc',
        #n_bars=len(counts_per_sub.columns),
        fixed_order=False,
        fixed_max=True,
        steps_per_period=5,
        interpolate_period=False,
        label_bars=True,
        bar_size=.95,
        period_label={'x': .99, 'y': .25, 'ha': 'right', 'va': 'center'},
        #period_fmt='%B %d, %Y',
        period_summary_func=lambda v, r: {'x': .99, 'y': .18,
                                          's': f'Total labels: {v.sum():,.0f}',
                                          'ha': 'right', 'size': 8, 'family': 'Courier New'},
        perpendicular_bar_func='median',
        period_length=50,
        figsize=(5, 3),
        dpi=144,
        cmap='dark12',
        title='Who is going to be the top labeller?',
        title_size='',
        bar_label_size=7,
        tick_label_size=7,
        shared_fontdict={'family' : 'Helvetica', 'color' : '.1'},
        scale='linear',
        writer=None,
        fig=None,
        bar_kwargs={'alpha': .7},
        filter_column_colors=False)  
    
    boto3.client('s3').upload_file('/tmp/barchart.mp4', bucket_name, 'barchart/barchart.mp4')
    
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }

Clean up

When you finish this exercise, remove your resources with the following steps:

  1. Delete your notebook instance.
  2. Stop your Ground Truth job.
  3. Optionally, delete the SageMaker execution role.
  4. Optionally, empty and delete the S3 bucket.

Conclusions

This post demonstrated how to use Ground Truth partial labeling data loaded in Amazon S3 to gamify labeling workflows by periodically creating a bar chart race. Engaging with workers with a bar chart race has been shown to spark a fruitful competition among workers, speed up labeling, and increase user engagement and user satisfaction.

Get started today! You can learn more about Ground Truth and kick off your own labeling and gamification processes by visiting the SageMaker console.


About the Authors

Daniele Angelosante is a Senior Engagement Manager with AWS Professional Services. He is passionate about AI/ML projects and products. In his free time, he likes coffee, sport, soccer, and baking.

Andrea Di Simone is a Data Scientist in the Professional Services team based in Munich, Germany. He helps customers to develop their AI/ML products and workflows, leveraging AWS tools. He enjoys reading, classical music and hiking.

Othmane Hamzaoui is a Data Scientist working in the AWS Professional Services team. He is passionate about solving customer challenges using Machine Learning, with a focus on bridging the gap between research and business to achieve impactful outcomes. In his spare time, he enjoys running and discovering new coffee shops in the beautiful city of Paris.