Automate media content filtering with AWS

The recent boom of content creation and creators led by the growth of social media platforms has elevated the need for effective and automated content moderation tools. One area of particular concern is the detection and removal of inappropriate language from video content. In this blog post, we demonstrate a solution to automate inappropriate language filtering in video using Amazon Transcribe, AWS Lambda, and AWS Elemental MediaConvert. We also provide sample code built with AWS Cloud Development Kit (AWS CDK) to deploy the solution.

How it works

Figure 1: Architecture diagram of the proposed solution workflow.

The above figure shows the architecture diagram of the proposed solution workflow. It operates as follows:

1/ A video file is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket that is the Ingest bucket. The file upload triggers a Lambda function that creates a new MediaConvert processing job. MediaConvert extracts an audio only file (audio-proxy) from the video and stores it in the S3 Proxy bucket.

2/ Upon the completion of the MediaConvert job, another Lambda function is triggered to create a transcription job for the audio-proxy in Amazon Transcribe. Using the “Vocabulary Filter” feature, Transcribe generates a transcription file where inappropriate language words are masked with three asterisks. The transcription file provides a list of all transcript words with their start and end times in the audio and is formatted in JSON (JavaScript Object Notation). The Transcribe job also produces a subtitles file. Both the transcription and subtitles files are stored in the Proxy bucket.

Note: The accuracy of the solution relies on the accuracy of Amazon Transcribe service and the list of inappropriate language words used in the vocabulary filter.

Here are examples of how masked and non-masked words are represented in the transcription file:

Example of the word “Welcome”:

JSON:

{"start_time":"8.859","end_time":"9.449","alternatives":[{"confidence":"0.999","content":"Welcome"}],"type":"pronunciation"}

Example of a masked inappropriate language word (***):

JSON:

{"start_time":"355.76","end_time":"356.2","alternatives":[{"confidence":"0.995","content":"***"}],"type":"pronunciation"}

3/ After the Transcribe job is completed, a new Lambda function is triggered to modify the audio-proxy file. The Lambda function searches the transcription file for the occurrences of “***” which now represent masked inappropriate language. The Lambda function uses the Pydub Python module to process the audio file and override inappropriate language occurrences based on their start and end times. In the sample code, we replace inappropriate language words in the audio with a beep sound that we provide. However, the code can be easily updated to insert any other sound or simply insert silence to mute inappropriate language.

Note: The Pydub module is licensed under the MIT License (MIT) as mentioned on the module’s page.

It’s worthwhile to mention that if no inappropriate language is found in the transcription file, the audio processing step is skipped. After inappropriate language is filtered from the audio, Lambda uploads the new filtered copy of the audio file to the Proxy bucket.

4/ Lastly, the processing Lambda function creates a new MediaConvert job to put the final media asset together. The MediaConvert job joins the video from the source file uploaded in step 1, the subtitles produced by Amazon Transcribe in step 2, and the filtered audio file produced in step 3. In our sample code and for demonstration purposes, MediaConvert produces an HTTP Live Streaming (HLS) asset that can be played back through Amazon CloudFront Content Management Network (CDN). However, the MediaConvert job settings can be updated to fit other workflows.

Sample code repository

The sample code of the solution is available on Github with a step-by-step guide that explains how to deploy the solution using AWS CDK to your AWS account.

In the README file of the repository, guidance is provided on how to create and configure vocabulary filters in Amazon Transcribe for the languages that you need to support.

Testing the solution

For demonstration purposes, we assembled a list of some slightly objectionable words and created the vocabulary filter in Amazon Transcribe with that list:

jerk
freaky
friky
fricky
freaked
fricked
friked
freak

We then tested the solution with the open source “Tears Of Steel” movie [(CC) Blender Foundation | mango.blender.org]. Here is a short clip demo showing before and after filtering:

Additional improvements

Additional improvements for the solution being considered are:

Support for 5.1 audio and any other audio formats. The code was tested with stereo audio only.
Support of videos requiring multiple languages detection. Currently one language per video file will be filtered.
Adding a configuration parameter to switch between beeping or muting inappropriate language in the audio.

If you are interested in any of the above improvements or have any other requirements, please let us know by creating an issue in the Github project.

Conclusion

Above is a solution to filter inappropriate language from media content that is based on Amazon Transcribe, AWS Lambda and AWS Elemental MediaConvert services. The solution includes AWS CDK sample code that can be deployed in your AWS account.

Additional resources

Amazon Transcribe Developer Guide: Using custom vocabulary filters to delete, mask, or flag words (https://docs.aws.amazon.com/transcribe/latest/dg/vocabulary-filtering.html)

AWS Media Blog: Use machine learning to filter user-generated content and protect your brand (https://aws.amazon.com/blogs/media/using-ml-to-filter-user-generated-content-to-protect-your-brand/)

AWS for M&E Blog