AWS Machine Learning Blog

How Süddeutsche Zeitung optimized their audio narration process with Amazon Polly

This is a guest post by Jakob Kohl, a Software Developer at the Süddeutsche Zeitung. Süddeutsche Zeitung is one of the leading quality dailies in Germany when it comes to paid subscriptions and unique users. Its website, SZ.de, reaches more than 15 million monthly unique users as of October 2021.

Thanks to smart speakers and podcasts, the audio industry has experienced a real boom in recent years. At Süddeutsche Zeitung, we’re constantly looking for new ways to make our diverse journalism even more accessible. As pioneers in digital journalism, we want to open up more opportunities for Süddeutsche Zeitung readers to consume articles. We started looking for solutions that could provide high-quality audio narration for our articles. Our ultimate goal was to launch a “listen to the article” feature.

In this post, we share how we optimized our audio narration process with Amazon Polly, a service that turns text into lifelike speech using advanced deep learning technologies.

Why Amazon Polly?

We believe that Vicki, the German neural Amazon Polly voice, is currently the best German voice on the market. Amazon Polly offers the impressive feature to switch between languages, correctly pronouncing for example English movie titles as well as personal names in different languages (for an example, listen to the article Schall und Wahn on our website).

A big part of our infrastructure already runs on AWS, so using Amazon Polly was a perfect fit. We can combine Amazon Polly with the following components:

  • An Amazon Simple Notification Service (Amazon SNS) topic to which we can subscribe for articles. The articles are sent to this topic by the CMS whenever they’re saved by an editor.
  • An Amazon CloudFront distribution with Lambda@Edge to paywall premium articles, which we can reuse for audio versions of articles.

The Amazon Polly API is easy to use and well documented. It took us less than a week to get our proof of concept to work.

The challenge

Hundreds of new articles are published every day on SZ.de. After initial publication, they might get updated several times for various reasons—new paragraphs are added in news-driven articles, typos are fixed, teasers are changed, or metadata is optimized for search engines.

Generating speech for the initial publication of an article is straightforward, because the whole text needs to be synthesized. But how can we quickly generate the audio for updated versions of articles without paying twice for the same content? Our biggest challenge was to prevent sending the whole text to Amazon Polly repeatedly for every single update.

Our technical solution

Every time an editor saves an article, the new version of the article is published to an SNS topic. An AWS Lambda function is subscribed to this topic and called for every new version of an article. This function runs the following steps:

  1. Check if the new version of the article has already been completely synthesized. If so, the function stops immediately (this may happen when only metadata is changed that doesn’t affect the audio).
  2. Convert the article into multiple SSML documents, roughly one for each text paragraph.
  3. For each SSML document, the function checks if it has already been synthesized to audio using calculated hashes. For example:
    1. If an article is saved for the first time, all SSML documents must be synthesized.
    2. If a typo has been fixed in a single paragraph, only the SSML document for this paragraph must be re-synthesized.
    3. If a new paragraph is added to the article, only the SSML document for this new paragraph must be synthesized.
  4. Send all not-yet-synthesized SSML documents separately to Amazon Polly.

These checks help optimize performance and reduce cost by preventing the synthesis of an entire article multiple times. We avoid incurring additional charges due to minor changes such as a title edit or metadata adjustments for SEO reasons.

The following diagram illustrates the solution workflow.

After Amazon Polly synthesizes the SSML documents, the audio files are sent to an output bucket in Amazon Simple Storage Service (Amazon S3). A second Lambda function is listening for object creation on that bucket, waits for the completion of all audio fragments of an article, and merges them into a final audio file using FFmpeg from a Lambda layer. This final audio is sent to another S3 bucket, which is used as the origin in our CloudFront distribution. In CloudFront, we reuse an existing paywall for premium articles for the corresponding audio version.

Based on our freemium model, we provide a shortened audio version of premium articles. Non-subscribers are able to listen to the first paragraph for free, but are required to purchase a subscription to access the full article.

Conclusion

Integration of Amazon Polly into our existing infrastructure was very straightforward. Our content requires minimal customization because we only include paragraphs and some additional breaks. The most challenging part was performance and cost optimization, which we achieved by splitting the article up into multiple SSML documents corresponding to paragraphs, checking for changes in each SSML document, and building the whole audio file by merging the fragments. With these optimizations, we are able to achieve the following:

  • Decrease the amount of synthesized characters by at least 50% by only synthesizing real changes.
  • Reduce the time it takes for a change in the article text to appear in the audio because there is less audio to synthesize.
  • Add arbitrary audio files between paragraphs without re-synthesizing the whole article. For example, we can include a sound file in the shortened audio version of a premium articles to separate the first paragraph from the ensuing note that a subscription is needed to listen to the full version.

In the first month after the launch of the “listen to the article” feature in our SZ.de articles, we received a lot of positive user feedback. We were able to reach almost 30,000 users during the first 2 months after launch. From these users, approximately 200 converted into a paid subscription only from listening to the teaser of an article behind our paywall. The “listen to the article” feature isn’t behind our paywall, but users can only listen to premium articles fully if they have a subscription. Our website also offers free articles without a paywall. In the future, we will expand the feature to other SZ platforms, especially our mobile news apps.


About the Author

Jakob Kohl is a Software Developer at the Süddeutsche Zeitung, where he enjoys working with modern technologies on an agile website team. He is one of the main developers of the “listen to an SZ article” feature. In his leisure time, he likes building wooden furniture, where technical and visual design is as important as in web development.