AWS Machine Learning Blog

Convert Your Text into an MP3 File with Amazon Polly and a Simple Python Script

Text-to-speech technology can turn any digital text into a multimedia experience, so people can listen to news, blog articles, or even a PDF document, while multitasking or on-the-go. With Amazon Polly, you can convert your RSS feed or email, and store the synthesized speech in the form of audio files.

Currently, the Amazon Polly console allows you to paste text that is up to 1500 characters long, choose your language and region, and choose a voice. Then you can listen to the converted text or download it as an MP3 file. You can also use the AWS Command Line Interface (AWS CLI) to perform the conversion from your AWS Management Console.

Copy a small amount of text that you want to try, and then open the console.

1. Type Polly in the search box.

2. To try the Amazon Polly service, in the Plain text tab, paste your text and listen the output.

If you want to convert long form text, such as a book, to speech using Amazon Polly, you need to break up your text into chunks of 1500 characters. Wouldn’t it be ideal if an AWS CLI command would allow you to input an unlimited size text file that would be automatically converted to an MP3 file?

A quick search reveals two solutions: a cloud solution based on AWS Lambda and AWS Batch, and a Node.js based solution. The first solution is a full-scale production system that requires good understanding of AWS services and the configuration of the services. The second solution requires the installation of Node.js. I was looking for a quick way to test the service locally from my computer with minimal configuration. Therefore I decided to create a simple Python script that would allow me to input an unlimited size text file and would output an mp3 file.

The script relies on the AWS CLI tool and a standard Linux/Unix command: ‘cat’. The script could simply read a .txt file and pass it to AWS CLI command for the conversion. However, I found that the MP3 output file lacks pauses between the sentences and paragraphs. Therefore, I used Speech Synthesis Markup Language (SSML) rather than simple text. SSML is a markup language with various tags. For example, for a pause between sentences you can use markup where 1s stands for one second pause.

Source code

# coding: utf-8
import subprocess
import codecs

f = codecs.open("story.txt", encoding='utf-8')

cnt = 0
file_names = ''

for line in f:
    rendered = ''
    line = line.replace('"', '\\"')
    command = 'aws polly synthesize-speech --text-type ssml --output-format "mp3" --voice-id "Salli" --text "{0}" {1}'

    if '\r\n' == line:
        #A pause after a paragraph
        rendered = '<speak><break time= "2s"/></speak>'
    else:
        #A pause after a sentence
        rendered = '<speak><amazon:effect name=\\"drc\\">' + line.strip() + '<break time=\\"1s\\"/></amazon:effect></speak>'
    
    file_name = ' polly_out{0}.mp3'.format(u''.join(str(cnt)).encode('utf-8'))
    cnt += 1
    command = command.format(rendered.encode('utf-8'), file_name)
    file_names += file_name
    print command
    subprocess.call(command, shell=True)

print file_names
execute_command = 'cat ' + file_names + '>result.mp3'
subprocess.call(execute_command, shell=True)

execute_command = 'rm ' + file_names
print 'Removing temporary files: ' + execute_command
subprocess.call(execute_command, shell=True)

The script will generate many MP3 files because each new line initiates an MP3 file generation. To have only one MP3 file, I simply merged the files using the cat command (cat polly_out0.mp3 polly_out1.mp3>result.mp3). The trick is that the cat command simply appends each file to another and won’t recreate metadata for the final MP3 file. The output MP3 file should work on most new audio players.

3. Open a text editor (Nano or Vim on Linux/Unix/MacOS or Notepad on Windows) , type your story and save the file under “story.txt” name.

4. Type python polly.py in the console to convert your story file into an audio file. The output of the script allows you track the execution, but you can comment it out.

Now you can upload your newly generated MP3 file on Amazon Music and play it on Alexa!

In this post, I showed you how to use the AWS Polly service to convert text to audio by employing a simple Python script. It allows you to convert any length text into an MP3 file. Additionally, I showcased how you can use SSML language to improve the quality of the speech. I hope that this post helps you enjoy your multimedia experience. If you have any questions, please use the comments section.


Additional Reading

To learn a different approach to the solution presented here, learn how to create audiobooks with Amazon Polly and AWS Batch.


About the Author

Dzidas Martinaitis is a Data Scientist for AWS EMEA. He applies Machine Learning and Data Science to derive insights from market trends, to better understand our customers’ changing needs and to optimally allocate Sales team resources in providing tailored support to them. Outside work he is a co-organizer for Data Science Luxembourg meetup group.