AWS Machine Learning Blog

How simpleshow uses Amazon Polly to voice stories in their explainer videos

More than ten years ago, simpleshow started to help their customers explain materials, ideas, and products by using three-minute animated explainer videos. These explainer videos use two hands and simple, black and white illustration to lead viewers through a story. Today, the company also provides mysimpleshow.com, a platform that allows anyone to produce high-quality explainer videos about virtually any topic. This platform is integrated with Amazon Polly, so anyone can use natural sounding voices for explainer videos, as long as transcripts are provided.

First I’ll tell you a bit more about simpleshow, and then I’ll show you how mysimpleshow is integrated with Amazon Polly.

Over the past ten years, simpleshow has scientifically proven the effectiveness of the explainer video format. simpleshow experts have helped customers present their topics in a simple and entertaining way in thousands of explainer videos.

The production of these videos requires many talents in the team:

  • Storytelling: Certified simpleshow concept writers create stories around basic facts.
  • Illustration: Talented artists illustrate objects and concepts at the right abstraction level.
  • Visualization: Storyboard artists and motion designers visualize the stories and animate them.
  • Voice: A network of professional speakers ensures the right tone.

The simpleshow team realized that explainer videos are a very versatile format, so they wanted to make the resource available to even more users in even more subject areas. Therefore, the simpleshow team created mysimpleshow.com, a platform that allows anyone to produce high-quality explainer videos about virtually any topic. mysimpleshow uses artificial intelligence (AI) and has an easy-to-use user interface..

The process at mysimpleshow is very simple:

  • First, users write their story. mysimpleshow provides guidance with templates and inspiration with sample stories that cover a broad selection of topics.
  • The text of the story is then analyzed by the artificial intelligence at the core of mysimpleshow—the Explainer Engine. The Explainer Engine uses natural language processing (NLP) to identify meaningful keywords, people, and places. Using Wikidata, the knowledge base behind Wikipedia, keyword terms are then generalized. For example, if the name of a tennis player or basketball player is present in the story, the Explainer Engine uses Wikidata to identify the profession of the person, as a result a tennis racket or a basketball is suggested as a suitable illustration.

    This means that even if an illustration hasn’t been created for the person in the story, the story is visualized in a highly fitting way. For location names in the story, the Explainer Engine finds out the number of inhabitants and offers a suitable skyline as an illustration.
  • At the click of a button, the Explainer Engine searches for the right image in the simpleshow database of all illustrations. All illustrations are tagged using a multi-tiered system.

simpleshow teams up with Amazon Polly

Spoken words are crucial for the transfer of knowledge in explainer videos. Most of the information is transmitted by voice. The illustrations and animations draw the user’s attention and support the storytelling for better understanding. As a result, the multisensory content is better retained than just a voice or animation alone.

mysimpleshow supports its users with another important component of an explainer video—it provides a computer-generated voice that reads the users’ stories. For reading the story, mysimpleshow uses Amazon Polly.

Why simpleshow uses Amazon Polly

simpleshow uses Amazon Polly for several reasons as the automated voice for explainer videos.

  • mysimpleshow is an AWS-based software as a service (SaaS), making extensive use of AWS Elastic Beanstalk, Amazon DynamoDB, Amazon Simple Workflow Service (SWF), Amazon Simple Queue Service (SQS), and other AWS services. The integration of mysimpleshow and Amazon Polly was straightforward.
  • With Amazon Polly, simpleshow was able to optimize the costs for text-to-speech. The team was able to significantly simplify maintenance and operations and improve scalability.
  • Amazon Polly supports many languages. mysimpleshow already exists in English and German. Amazon Polly voices are available for possible extension to many other languages.
  • The Amazon Polly voices are of high quality.
  • Amazon Polly allows customized pronunciation of words.

All of these reasons are important, but the last one especially stood out. The Amazon Polly pronunciation lexicons enable mysimpleshow to customize the pronunciation of words.

For example, in the German language, new words are often formed by putting together existing words. Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz (literally “beef labeling supervision duties delegation law”)[1] is a very long German compound word. Amazon Polly can handle these linguistic compounds very well on its own. However, in German, this type of word formation is also extended to words imported from other languages. For example, the name of the product—mysimpleshow— is a combination of the words my-simple-show. German-speaking users are advised to divide compounds derived from English words into the individual words. This usually significantly improves the pronunciation of these words.

Some code examples

The following code samples illustrate how mysimpleshow uses Amazon Polly.

mysimpleshow uses Simple Synthesis Markup Language (SSML) for requests to Amazon Polly. SSML gives the highest level of control over how the voices are rendered. In addition, the SSML representation is very useful for debugging purposes.

SSML

<speak><prosody volume="+20dB" rate="100%"><break time="500.0ms"/>This is Tom. He wants to buy a used car. So he starts browsing the internet.</prosody></speak>

In the first step, timings for the spoken words are requested from Amazon Polly. Timings for the keywords define when the illustration related to the keyword is placed in the scene. The spoken words basically define the timeline of the video. This is somewhat specific to Amazon Polly. Other TTS services may provide MP3 and timings together.

TTS Call Timings

import com.amazonaws.services.polly.AmazonPolly;
import com.amazonaws.services.polly.model.OutputFormat;
import com.amazonaws.services.polly.model.SynthesizeSpeechRequest;
import com.amazonaws.services.polly.model.SynthesizeSpeechResult;
import com.amazonaws.services.polly.model.TextType;
import com.mysimpleshow.backend.common.enums.Voice;
import com.mysimpleshow.backend.common.service.DictionaryService;

class TTSService {

    private DictionaryService dictionaryService;

    private AmazonPolly polly;

    public SynthesizeSpeechResult synthesizePolly(final String text, final Voice voice) {
        final SynthesizeSpeechRequest request = new SynthesizeSpeechRequest()
                .withOutputFormat(OutputFormat.Json)
                .withText(text)
                .withTextType(TextType.Ssml)
                .withVoiceId(voice.getVoiceId())
                .withLexiconNames(dictionaryService.getDictionaryNameForLocale(voice.getLocale());

        return polly.synthesizeSpeech(request);
    }
}

In the second step, the MP3 for the spoken words is generated. This is pretty much the same call as before – only the result is now MP3 instead of the JSON with the earlier timings.

TTS Call MP3

import com.amazonaws.services.polly.AmazonPolly;
import com.amazonaws.services.polly.model.OutputFormat;
import com.amazonaws.services.polly.model.SynthesizeSpeechRequest;
import com.amazonaws.services.polly.model.SynthesizeSpeechResult;
import com.amazonaws.services.polly.model.TextType;
import com.mysimpleshow.backend.common.enums.Voice;
import com.mysimpleshow.backend.common.service.DictionaryService;

class TTSService {

    private DictionaryService dictionaryService;

    private AmazonPolly polly;

    public SynthesizeSpeechResult synthesizePolly(final String text, final Voice voice) {
        final SynthesizeSpeechRequest request = new SynthesizeSpeechRequest()
            .withOutputFormat(OutputFormat.Mp3)
            .withText(text)
            .withTextType(TextType.Ssml)
            .withVoiceId(voice.getVoiceId())
            .withLexiconNames(dictionaryService.getDictionaryNameForLocale(voice.getLocale()));

        return polly.synthesizeSpeech(request);
    }
}

In the third and final step, the Amazon Polly voice, background music, more sounds, and the animation are combined into a finished video using ffmpeg.

ffmpeg -y -i <video-path> -i <sound-path> <output-path> -shortest

mysimpleshow can provide a great experience for our customers by using the variety of human voices Amazon Polly provides. In addition, using Amazon Polly is low cost; it provides an easy integration through its API; and it gives us the ability to customize voices. Amazon Polly has been a crucial AI service for mysimpleshow, and we look forward to innovating more with it.


About the Author

Hans-Christian Pahlig is Head of IT at simpleshow. He leads the development and the IT infrastructure teams. He is an internationally experienced IT expert in the media industry. As a math graduate, Hans-Christian comes from the school of thought where neural networks were conceived. If not on the job, he is inspired by the art and zen of landscape photography.