AWS for M&E Blog
New – Streamline media analysis tasks with Amazon Rekognition Video
Amazon Rekognition Video is a machine learning (ML) based service that can analyze videos to identify objects, people, faces, text, scenes, and activities, as well as detect any inappropriate content. Starting today, you can streamline media analysis tasks by automating the detection of black frames, end credits, shot changes, and color bars using Amazon Rekognition Video. By automating these tasks, you can reduce the time, effort, and costs associated with workflows like video ad insertion, content operations, and content production.
Challenges with media analysis
Viewers are watching more content than ever, with Over-The-Top (OTT) and Video-On-Demand (VOD) platforms in particular providing a rich selection of content choices anytime, anywhere, and on any screen. Media customers have told us that with proliferating content volumes, they are facing challenges in preparing and managing content, which are crucial to providing a high-quality viewing experience and better monetizing content. Today, companies use large teams of trained human workforces to perform tasks such as finding where the end credits begin in an episode, choosing the right spots to insert ads, or breaking up videos into smaller clips for better indexing. These manual processes are expensive, slow, and cannot scale to keep up with the volume of content being produced, licensed, and retrieved from archives daily.
Introducing Amazon Rekognition Video for media analysis
Amazon Rekognition Video makes it easy to automate these operational media analysis tasks by providing fully managed, purpose-built APIs powered by ML. Using these APIs, you can easily analyze large volumes of videos stored in Amazon S3, detect markers such as black frames or shot changes, and get SMPTE (Society of Motion Picture and Television Engineers) timecodes and timestamps for each detection – without requiring any machine learning experience. Returned SMPTE timecodes are frame accurate, which means that Amazon Rekognition Video provides the exact frame number when it detects a relevant segment of video, and handles various video frame rate formats under the hood. Using the frame accurate metadata from Amazon Rekognition Video, you can either automate certain tasks completely, or significantly reduce the review workload of trained human operators, so that they can focus on more creative work. This enables you to perform tasks such as content preparation, ad insertion, and adding ‘binge-markers’ to content at scale in the cloud. With Amazon Rekognition Video, you pay only for what you use. There are no minimum fees, licenses, or upfront commitments.
Key features
Let us look at each media analysis feature, common use cases, and some sample detections returned by Amazon Rekognition Video. For this section, we are using clips from Big Buck Bunny (2008) and Tears of Steel (2013), two open-source films made by the Blender Institute, and distributed under Creative Commons License 3.0.
Black frames detection: Videos often contain a short duration of empty black frames with no audio that are used as cues to insert advertisements, or to demarcate the end of a program segment such as a scene or the opening credits. With Amazon Rekognition Video, you can detect such black frame sequences to automate ad insertion, package content for VOD, and demarcate various program segments or scenes. Black frames with audio (such as fade outs or voiceovers) are considered as content and not returned.
End credits detection: Amazon Rekognition Video helps you automatically identify the exact frames where the closing credits start and end for a movie or TV show. With this information, you can generate markers for interactive viewer prompts such as ‘Next Episode’ in VOD applications, or find out the last frame of program content in a video. Amazon Rekognition Video is trained to handle a wide variety of end credit styles ranging from simple rolling credits to more challenging credits alongside content, and excludes opening credits automatically.
Shot detection: A shot is a series of interrelated consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space. With Amazon Rekognition Video, you can detect the start, end, and duration of each shot, as well as a count all the shots in a piece of content. Shot metadata can be used for applications such as creating promotional videos using selected shots, generating a set of preview thumbnails that avoid transitional content between shots, and inserting ads in spots that don’t disrupt viewer experience, such as the middle of a shot when someone is speaking.
Color bars detection: Amazon Rekognition Video allows you to detect sections of video that display SMPTE color bars, which are a set of colors displayed in specific patterns to ensure color is calibrated correctly on broadcast monitors, programs, and on cameras. This metadata is useful to prepare content for VOD applications by removing color bar segments from the content, or to detect issues such as loss of broadcast signals in a recording, when color bars are shown continuously as a default signal instead of content.
A typical timeline for a video asset in the media supply chain might look like the following (note the color bars at the beginning, the black frames throughout the video, and the end credits at the end). With Amazon Rekognition Video, you can detect each of these segments automatically and get frame accurate start and end timecodes.
How it Works
These media analysis features are available through the Amazon Rekognition Video segment detection API. This is an asynchronous API composed of two operations: StartSegmentDetection to start the analysis, and GetSegmentDetection to get the analysis results. Let us understand each of these operations in more detail.
Starting segment detection
StartSegmentDetection accepts an H.264 video stored in Amazon S3 along with input parameters, and returns a unique JobId upon successful completion. We recommend using a 720p or 1080p ‘proxy’ version of your content for best results. If you have high resolution source files in formats like Apple ProRes or MXF, you can use AWS Elemental MediaConvert to transcode them to H.264 first. The following is an example request JSON for StartSegmentDetection:
{
"Video": {
"S3Object": {
"Bucket": "test_files",
"Name": "test_file.mp4"
},
"ClientRequestToken": "SegmentDetectionToken",
"NotificationChannel": {
"SNSTopicArn": "arn:aws:sns:us-east-1:111122223333:AmazonRekognitionSegmentationTopic",
"RoleArn": "arn:aws:iam::111122223333:role/RekVideoServiceRole"
},
"JobTag": "SegmentingVideo",
"SegmentTypes": [
"TECHNICAL_CUE",
"SHOT"
],
"Filters": {
"TechnicalCueFilter": {
"MinSegmentConfidence": 90.0
},
"ShotFilter": {
"MinSegmentConfidence": 80.0
}
}
}
}
Black frames, color bars, and end credits are collectively called ‘Technical Cues’. By choosing different values for SegmentTypes, you can detect Technical Cues, shots or both. In the example above, both Technical Cues and shots will be detected. Each detection also contains a prediction confidence score. By specifying MinSegmentConfidence filters, you can filter out detections that don’t meet your confidence threshold. For example, setting a 90% threshold for Technical Cues will filter out all results whose confidence is below 90%.
Getting segment detection results
Using the JobId obtained from the StartSegmentDetection call, you can now call GetSegmentDetection. This API takes in the JobId, and the maximum number of results you want. It then returns results for the requested analysis as well as general metadata about the stored video. Here is how a GetSegmentDetection request looks like:
{
"JobId": "270c1cc5e1d0ea2fbc59d97cb69a72a5495da75851976b14a1784ca90fc180e3",
"MaxResults": 10,
…
}
And here is a sample response from GetSegmentDetection:
"JobStatus": "SUCCEEDED",
"VideoMetadata": [
{
"Codec": "h264",
"DurationMillis": 478145,
"Format": "QuickTime / MOV",
"FrameRate": 24.0,
"FrameHeight": 360,
"FrameWidth": 636
}
],
"AudioMetadata": [
{
"Codec": "aac",
"DurationMillis": 478214,
"SampleRate": 44100,
"NumberOfChannels": 2
}
],
"Segments": [
{
"Type": "TECHNICAL_CUE",
"StartTimestampMillis": 121666,
"EndTimestampMillis": 471333,
"DurationMillis": 349667,
"StartTimecodeSMPTE": "00:02:01:16",
"EndTimecodeSMPTE": "00:07:51:08",
"DurationSMPTE": "00:05:49:16",
"TechnicalCueSegment": {
"Type": "EndCredits",
"Confidence": 84.85398864746094
}
},
{
"Type": "SHOT",
"StartTimestampMillis": 0,
"EndTimestampMillis": 29041,
"DurationMillis": 29041,
"StartTimecodeSMPTE": "00:00:00:00",
"EndTimecodeSMPTE": "00:00:29:01",
"DurationSMPTE": "00:00:29:01",
"ShotSegment": {
"Index": 0,
"Confidence": 87.50452423095703
}
},
],
"SelectedSegmentTypes": [
{
"Type": "SHOT",
"ModelVersion": "1.0"
},
{
"Type": "TECHNICAL_CUE",
"ModelVersion": "1.0"
}
]
}
As you can see, each detection contains the start, end, duration and confidence for a segment type. Amazon Rekognition Video provides both frame accurate SMPTE timecodes and millisecond timestamps, and handles different types of frame rate standards such as integer (e.g. 25 fps), fractional (e.g. 23.976 fps), and drop-frame (e.g. 29.97 fps). Shots also have an Index field to keep a count of the number of shots elapsed at a certain point in a video.
Customer stories
Customers have told us that they can use these new features for media analysis to simplify video ad insertion, content production, and content operation workflows. Following are some examples of how customers are deriving value from the features.
A+E Networks® is a collection of culture brands that includes A&E®, HISTORY®, Lifetime®, LMN™, FYI™, Vice TV and BIOGRAPHY®. We are in seven out of 10 American homes, cumulatively reach 335 million people worldwide and have 500+ million digital users.
“A+E Networks receives thousands of hours of new programing each year, with each file going through dozens of automated workflows to get to the right people at the right time. This automation is often hampered, however, by a key challenge – identifying where each segment within the file begins or ends. Our technicians must first view the video file and then manually enter every timecode to enable automated processes like transcode and quality control. With the metadata from Amazon Rekognition Video, we now have the ability to make quick, automated decisions on content as soon as it arrives. Knowing where segments start or stop with data-informed timecodes enables earlier media supply chain decisions – like what length to make a preliminary screener that starts from the first frame after color bars or slate, eliminating slugs and ending before credits. This has the potential to help us improve the quality of our output, save hundreds of work-hours each year, and respond quickly in a highly-dynamic content marketplace.”
Nomad is a cloud-native intelligent content management platform built on AWS serverless architecture, which seamlessly merges content and asset management with the power of AI/ML into one unified system.
“The Nomad Platform leverages video shot and segment level analysis for detecting, generating, and searching rich metadata for objects, persons, labels, dialogue and screen text. Analyzing the video and detecting the discrete shots accurately has been very challenging, and up to this point, we’ve used an in-house custom shot analyzer to separate the video into the searchable segments. With the new Amazon Rekognition Video features for media analysis, our shot detection accuracy has doubled, and we get the added benefit of detecting other segment types like black frames and end credits automatically. Higher shot detection accuracy and newly detectable segment types in the Nomad Platform allows us to greatly improve the user search experience and substantially reduce customer costs by avoiding additional metadata processing that was required previously.”
Promomii is an AI powered video logging and promo generation software company that helps creatives maximize the potential of their videos.
“Editors and producers in the broadcasting and creative video industry spend huge amounts of time going through large volumes of video footage to produce content. This process is monotonous, time-consuming and expensive. Promomii aims to streamline such labor-intensive work by providing accurate and thorough video analysis for our clients, so that they can allocate more resources towards creative work. By combining Amazon Rekognition Video features such as shot detection with PromoMii’s own algorithms, we can quickly and easily provide editors with the most interesting or valuable visual shots during their creative process and help them sell the content better in lesser time.”
Synchronized transforms passive, linear video into ‘Smart-Video’. Our artificial intelligence engine understands the content and context of a video and enriches it with metadata. This metadata then frees the video from linearity making it fully interactive and as powerful as hypertext to meet the demands and expectations of the digital world.
“Today, television channels, driven by the demands of digital consumers, need to adapt traditional, long-form content produced for linear TV into short-form segments to satisfy online consumption. Segmenting and clipping content editorially is important for broadcasters so viewers can directly access the parts that are interesting to them. The Synchronized platform automates the full work flow required to segment, clip and distribute video content for broadcasters. However, accurate, automatic transformation of audiovisual content into editorial segments is an extremely complex task requiring layers of different techniques. But now, by combining Amazon Rekognition Video with our platform’s Smart-Segmentation service, we can significantly accelerate, streamline and automate the creation and delivery of clips accurately to TV editorial teams. They can then manipulate the segments without requiring specialists, and distribute them immediately. This process is not scalable if done manually. In addition, the ability to automatically detect end credits with Amazon Rekognition Video allows us to offer our customers a fully automated, turnkey solution to add features such as “Next Episode” buttons to their content catalogs.”
Getting Started
You can start using video segment detection APIs by downloading the latest AWS SDK. Please refer to our documentation for more details on the API and code samples.
If you want to visualize the results of media analysis or even try out other Amazon AI services like Amazon Transcribe with your own videos, don’t forget to check out the Media Insights Engine (MIE) – a serverless framework to easily generate insights and develop applications for your video, audio, text, and image resources, using AWS Machine Learning and Media services. You can easily spin up your own MIE instance using the supplied AWS CloudFormation template, and then use the sample application linked in the ‘Outputs’ tab of your AWS CloudFormation console to try out your own videos and visualize analysis results. Here is how the MIE sample application console looks like:
Conclusion and Additional Resources
In this blog, we introduced new Amazon Rekognition Video features for media analysis, discussed key benefits, saw some examples for the detection of black frames, end credits, shot changes and color bars, outlined how it works and how customers are using it, and provided some API examples. To learn more, you can read our documentation and check out the Media Insights Engine.