How to turn articles into videos using AWS Elemental MediaConvert and Amazon Polly

Digital media publishers are reinventing their user experiences to meet their customers’ needs. Especially when multitasking, on the go, or interacting with smart devices, readers are looking for alternative ways to consume content.

Infographics, photo stories, podcasts, audiobooks, visual narrations, and social media stories are only a few examples of emerging content types that adapt to readers’ more sophisticated expectations.

In a previous post on the AWS Machine Learning Blog, we showed how you can voice a blog post automatically with Trinity Audio’s WordPress plugin. Building on that, in this guide we show you how you can turn articles into visual narrations and provide teasers shareable on social media platforms by using Amazon Web Services (AWS).

Turn articles into visual narrations and provide teasers shareable on social media platform

The sample article we’re going to use in this example

This is an animated GIF of the result of processing an article to transform it into a Social Media Story

This is the result of processing an article to transform it into a social media story (shown here as an animated GIF)

This post is the first of a two-part series. In this first part, we look at how content is acquired, processed, and made available for consumption. In the second part, we look at how to extend the solution to monetize the content using AWS Elemental MediaTailor, a channel assembly and personalized ad insertion service.

Services used

Before diving deep into the solution, let’s get familiar with the services we use.

AWS CDK

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework you can use to define your cloud application resources using familiar programming languages. In this series, we use AWS CDK to define the infrastructure for our solution, using TypeScript.

Amazon Polly

Amazon Polly is a service that turns text into lifelike speech, making it simple for you to create applications that talk and build entirely new categories of products activated for speech. Amazon Polly’s text-to-speech service uses advanced deep learning technologies to synthesize natural-sounding human speech. With dozens of lifelike voices across a broad set of languages, you can build speech-activated applications that work in many different countries. We use Amazon Polly to bring your text to life in the visual narration.

AWS Elemental MediaConvert

AWS Elemental MediaConvert is a file-based video transcoding service with broadcast-grade features. Using it makes it simple to create video-on-demand content for broadcast and multiscreen delivery at scale. In this example, we use MediaConvert to assemble the visual contents and the audio generated by Amazon Polly. We also use MediaConvert to prepare the content for streaming by generating an HTTP Live Stream that playlist browsers and apps can stream.

Amazon Comprehend

Amazon Comprehend is a natural-language processing service that uses machine learning to uncover valuable insights and connections in text. We use Amazon Comprehend to extract keywords and labels from the text of an article body and feed them to our ad decision server to provide contextual ads.

Other components

HLS

HTTP Live Streaming (HLS) is an HTTP adaptive bitrate (ABR) streaming communications protocol. To host and stream using HLS, you only need a conventional web server: media files split into downloadable segments and organized in playlists (M3U8 files).

The user’s player downloads the playlist first and then proceeds to download segments as the user progresses through the media. HLS supports the H.264 video codec and audio in AAC, MP3, AC-3, or EC-3 encapsulated in MPEG-2 streams. The following is an example of an M3U8 file:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:11
#EXT-X-MEDIA-SEQUENCE:1
#EXT-X-PLAYLIST-TYPE:VOD
#EXTINF:11,
mediafile_00001.ts
#EXTINF:10,
mediafile_00002.ts
#EXTINF:11,
mediafile_00003.ts
#EXTINF:10,
mediafile_00004.ts
#EXTINF:11,
mediafile_00005.ts
...

An extract from an M3U8 file describing an HLS playlist

WordPress content

For this code sample, we create our content using WordPress, a content management system (CMS) widely adopted by customers globally. We scrape content from the Accelerated Mobile Pages (AMP)–rendered webpage because it offers a more predictable HTML structure. For production systems, we recommend you use CMS APIs to gather the required content in a more predictable way.

FFmpeg

IMPORTANT LEGAL NOTICE: This solution uses FFmpeg to analyze and manipulate the low-level visual and audio features of the uploaded media files. FFmpeg is a free and open-source software suite for handling video, audio, and other multimedia files and streams. FFmpeg is distributed under the GNU Lesser General Public License (LGPL). For more information about FFmpeg, please see the following link: https://www.ffmpeg.org/. Your use of this solution means you will use FFmpeg. If you do not want to use FFmpeg, do not use this solution.

Prerequisites

To follow along with this post, you need an AWS account.

It’s advised, but not required, to have prior experience with AWS CDK, Python, and JavaScript.

Solution overview

The solution’s choreography diagram

In this post, we propose an event-driven choreographic workflow that ingests an article to produce a visual narrative by:

Voicing the article using Amazon Polly
Using the images attached to the post to produce a video slideshow

The workflow makes use of Amazon Comprehend to:

Infer the written language of the article so the workflow can choose the right voice for Amazon Polly to use to read the article
Extract key phrases and words that the workflow uses to produce a VMAP manifest to feature context-aware ads. For simplicity, we only use pre-roll and post-roll ad insertion, but you can learn about how to find midroll opportunities in this previous post. We dive deep in the monetization aspects in the second post of this series, so stay tuned!

The input to the workflow is a URL to a WordPress AMP–rendered article. The workflow outputs the following:

A 30-second video teaser of your article, which you can share on social media: this is an MP4 file stored on Amazon Simple Storage Service (Amazon S3), an object storage service offering industry-leading scalability, data availability, security, and performance.
A full video narration of variable length (depending on how much text is in the article) in HLS format (M3U8 manifest file and related segments) stored in an Amazon S3 bucket
A set of text, media, and image files used to create the videos stored in an Amazon S3 bucket—more details to follow
A set of metadata generated by the consumption of the article, conveniently stored in a table in Amazon DynamoDB, a fully managed, serverless, key-value NoSQL database. The metadata generated is as follows:

{
    "AssetId": "2af6fa09-078a-410e-a84b-969c0182119f.json",
    "Engine": "neural",
    "PreviewVideoFile": "s3://pollypreviewsimplestack-pollyassetstore920ee247-1pvn3ec82d905/output/preview/2af6fa09-078a-410e-a84b-969c0182119f.mp4",
    "ArticlePath": "s3://pollypreviewsimplestack-pollyassetstore920ee247-1pvn3ec82d905/text/2af6fa09-078a-410e-a84b-969c0182119f.json",
    "PostProducedImagesS3Paths": [
        "s3://pollypreviewsimplestack-pollyassetstore920ee247-1pvn3ec82d905/image/output/2af6fa09-078a-410e-a84b-969c0182119f.json/ferrari_01.jpg.tga",
        "s3://pollypreviewsimplestack-pollyassetstore920ee247-1pvn3ec82d905/image/output/2af6fa09-078a-410e-a84b-969c0182119f.json/ferrari_04.jpg.tga",
        "s3://pollypreviewsimplestack-pollyassetstore920ee247-1pvn3ec82d905/image/output/2af6fa09-078a-410e-a84b-969c0182119f.json/ferrari_02.jpg.tga",
        "s3://pollypreviewsimplestack-pollyassetstore920ee247-1pvn3ec82d905/image/output/2af6fa09-078a-410e-a84b-969c0182119f.json/ferrari_05.jpg.tga"
    ],
    "Bucket": "pollypreviewsimplestack-pollyassetstore920ee247-1pvn3ec82d905",
    "Url": "https://giusedroid.wordpress.com/2021/04/29/a-brief-history-of-ferrari/amp/",
    "LanguageCode": "en-US",
    "FullVideoStream": "s3://pollypreviewsimplestack-pollyassetstore920ee247-1pvn3ec82d905/output/full/hls/2af6fa09-078a-410e-a84b-969c0182119f/template.m3u8",
    "FullNarration": "s3://pollypreviewsimplestack-pollyassetstore920ee247-1pvn3ec82d905/audio/full/2af6fa09-078a-410e-a84b-969c0182119f.json/.5678d4db-4116-4201-b5d7-529b6f2e42e0.mp3",
    "VoiceId": "Joanna",
    "FullNarrationDurationInSeconds": 212.297167,
    "AudioPreview": "s3://pollypreviewsimplestack-pollyassetstore920ee247-1pvn3ec82d905/audio/preview/2af6fa09-078a-410e-a84b-969c0182119f.json/.5678d4db-4116-4201-b5d7-529b6f2e42e0.wav",
    "ImagesURLs": [
        "https://giusedroid.files.wordpress.com/2021/04/ferrari_01.jpg",
        "https://giusedroid.files.wordpress.com/2021/04/ferrari_04.jpg",
        "https://giusedroid.files.wordpress.com/2021/04/ferrari_02.jpg",
        "https://giusedroid.files.wordpress.com/2021/04/ferrari_05.jpg"
    ]
}

This is a sample of the dataset created by the ingestion of an article. This is stored as an item in the Amazon DynamoDB table that is deployed by the solution.

Choreography pattern

To keep things simple for this post, we opted for a serverless, event-driven architecture following the choreography pattern. We use the Amazon S3 Event Notification feature—which lets you receive notifications when certain events happen in your S3 bucket—to drive the workflow. A PUT event triggered by an upload under a specific path in our asset store invokes each step in the choreography is invoked by a PUT event under a specific path in our asset store. All of the functions that compose the workflow are invoked asynchronously, with the exception of the one invoked by Amazon API Gateway at the beginning of the workflow. Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. The following is a diagram illustrating the event sequence:

A visual representation of the workflow choreography

Here is a description of the workflow:

It starts with an AWS Lambda function, which ingests an article once provided its URL, produces a JSON document, and uploads it to Amazon S3.
This event invokes another AWS Lambda function, which asynchronously invokes Amazon Polly to start a speech synthesis task.
Once the synthesis task completes, an MP3 file is uploaded to Amazon S3 by Amazon Polly, invoking the extraction of a 30-second preview through a dedicated AWS Lambda function.
The next invoked step consists of downloading and preprocessing the images attached to the article.
The next step kicks off two parallel MediaConvert jobs.
Finally, when the media assets write to Amazon S3 by MediaConvert, the last AWS Lambda function invokes to update the metadata store with the paths to the assets.

Each step in the choreography writes updates to the Amazon DynamoDB table deployed by the solution. This table is used as an asset metadata management store, whereas the Amazon S3 bucket is meant for media files and complex metadata storage.

Get the guide

Download the comprehensive guide on how to build and operate this code sample here.

Conclusion

In this post, we shared an overview of how you can turn articles into videos by using a fully automated editorial workflow built using AWS services.

You can download a sample preview result here. This is the result of the ingestion of this article.

We invite you to clone this repository and start experimenting to tune the solution for your needs. Let us know in the comments what you think of this approach and feel free to use the repository issues section to raise requests or ask questions. Stay tuned for the next post, where we use some of the assets produced by this workflow to monetize our automatically generated content.

AWS for M&E Blog