AWS News Blog

New – Label Videos with Amazon SageMaker Ground Truth

Launched at AWS re:Invent 2018, Amazon Sagemaker Ground Truth is a capability of Amazon SageMaker that makes it easy to annotate machine learning datasets. Customers can efficiently and accurately label image, text and 3D point cloud data with built-in workflows, or any other type of data with custom workflows. Data samples are automatically distributed to a workforce (private, 3rd party or MTurk), and annotations are stored in Amazon Simple Storage Service (S3). Optionally, automated data labeling may also be enabled, reducing both the amount of time required to label the dataset, and the associated costs.

As models become more sophisticated, AWS customers are increasingly applying machine learning prediction to video content. Autonomous driving is perhaps the most well-known use case, as safety demands that road condition and moving objects be correctly detected and tracked in real-time. Video prediction is also a popular application in Sports, tracking players or racing vehicles to compute all kinds of statistics that fans are so fond of. Healthcare organizations also use video prediction to identify and track anatomical objects in medical videos. Manufacturing companies do the same to track objects on the assembly line, parcels for logistics, and more. The list goes on, and amazing applications keep popping up in many different industries.

Of course, this requires building and labeling video datasets, where objects of interest need to be labeled manually. At 30 frames per second, one minute of video translates to 1,800 individual images, so the amount of work can quickly become overwhelming. In addition, specific tools have to be built to label images, manage workflows, and so on. All this work takes valuable time and resources away from an organization’s core business.

AWS customers have asked us for a better solution, and today I’m very happy to announce that Amazon Sagemaker Ground Truth now supports video labeling.

Customer use case: the National Football League
The National Football League (NFL) has already put this new feature to work. Says Jennifer Langton, SVP of Player Health and Innovation, NFL: “At the National Football League (NFL), we continue to look for new ways to use machine learning (ML) to help our fans, broadcasters, coaches, and teams benefit from deeper insights. Building these capabilities requires large amounts of accurately labeled training data. Amazon SageMaker Ground Truth was truly a force multiplier in accelerating our project timelines. We leveraged the new video object tracking workflow in addition to other existing computer vision (CV) labeling workflows to develop labels for training a computer vision system that tracks all 22 players as they move on the field during plays. Amazon SageMaker Ground Truth reduced the timeline for developing a high quality labeling dataset by more than 80%”.

Courtesy of the NFL, here are a couple of predicted frames, showing helmet detection in a Seattle Seahawks video. This particular video has 353 frames. This first picture is frame #100.

Object tracking

This second picture is frame #110.

Object tracking

Introducing Video Labeling
With the addition of video task types, customers can now use Amazon Sagemaker Ground Truth for:

  • Video clip classification
  • Video multi-frame object detection
  • Video multi-frame object tracking

The multi-frame task types support multiple labels, so that you may label different object classes present in the video frames. You can create labeling jobs to annotate frames from scratch, as well as adjustment jobs to review and fine tune frames that have already been labeled. These jobs may be distributed either to a private workforce, or to a vendor workforce you picked on AWS Marketplace.

Using the built-in GUI, workers can then easily label and track objects across frames. Once they’ve annotated a frame, they can use an assistive labeling feature to predict the location of bounding boxes in the next frame, as you will see in the demo below. This significantly simplifies labeling work, saves time, and improves the quality of annotations. Last but not least, work is saved automatically.

Preparing Input Data for Video Object Detection and Tracking
As you would expect, input data must be located in S3. You may bring either video files, or sequences of video frames.

The first option is the simplest, as Amazon Sagemaker Ground Truth includes a tool that automatically extracts frames from your video files. Optionally, you can sample frames (1 in ‘n’), in order to reduce the amount of labeling work. The extraction tool also builds a manifest file describing sequences and frames. You can learn more about it in the documentation.

The second option requires two steps: extracting frames, and building the manifest file. Extracting frames can easily be performed with the popular ffmpeg open source tool. Here’s how you could convert the first 60 seconds of a video to a frame sequence.

$ ffmpeg -ss 00:00:00.00 -t 00:01:0.00 -i basketball.mp4 frame%04d.jpg

Each frame sequence should be uploaded to S3 under a different prefix, for example s3://my-bucket/my-videos/sequence1, s3://my-bucket/my-videos/sequence2, and so on, as explained in the documentation.

Once you have uploaded your frame sequences, you may then either bring your own JSON files to describe them, or let Ground Truth crawl your sequences and build the JSON files and the manifest file for you automatically. Please note that a video sequence cannot be longer than 2,000 frames, which corresponds to about a minute of video at 30 frames per second.

Each sequence should be described by a simple sequence file:

  • A sequence number, an S3 prefix, and a number of frames.
  • A list of frames: number, file name, and creation timestamp.

Here’s an example of a sequence file.

{"version": "2020-06-01",
"seq-no": 1, "prefix": "s3://jsimon-smgt/videos/basketball", "number-of-frames": 1800, 
	"frames": [
		{"frame-no": 1, "frame": "frame0001.jpg", "unix-timestamp": 1594111541.71155},
		{"frame-no": 2, "frame": "frame0002.jpg", "unix-timestamp": 1594111541.711552},
		{"frame-no": 3, "frame": "frame0003.jpg", "unix-timestamp": 1594111541.711553},
		{"frame-no": 4, "frame": "frame0004.jpg", "unix-timestamp": 1594111541.711555},
. . .

Finally, the manifest file should point at the sequence files you’d like to include in the labeling job. Here’s an example.

{"source-ref": "s3://jsimon-smgt/videos/seq1.json"}
{"source-ref": "s3://jsimon-smgt/videos/seq2.json"}
. . .

Just like for other task types, the augmented manifest is available in S3 once labeling is complete. It contains annotations and labels, which you can then feed to your machine learning training job.

Labeling Videos with Amazon SageMaker Ground Truth
Here’s a sample video where I label the first ten frames of a sequence. You can see a screenshot below.

I first use the Ground Truth GUI to carefully label the first frame, drawing bounding boxes for basketballs and basketball players. Then, I use the “Predict next” assistive labeling tool to predict the location of the boxes in the next nine frames, applying only minor adjustments to some boxes. Although this was my first try, I found the process easy and intuitive. With a little practice, I could certainly go much faster!

Getting Started
Now, it’s your turn. You can start labeling videos with Amazon Sagemaker Ground Truth today in the following regions:

  • US East (N. Virginia), US East (Ohio), US West (Oregon),
  • Canada (Central),
  • Europe (Ireland), Europe (London), Europe (Frankfurt),
  • Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Seoul), Asia Pacific (Sydney), Asia Pacific (Tokyo).

We’re looking forward to reading your feedback. You can send it through your usual support contacts, or in the AWS Forum for Amazon SageMaker.

- Julien
Julien Simon

Julien Simon

As an Artificial Intelligence & Machine Learning Evangelist for EMEA, Julien focuses on helping developers and enterprises bring their ideas to life.