Create Audiobooks with Amazon Polly and AWS Batch

by Matthew McClean | on | Permalink | Comments |  Share

Amazon Polly, one of AWS’s first AI services, turns text into lifelike speech. By enabling applications to speak, Amazon Polly makes it possible to develop new types of speech-enabled products. For example, many AWS customers have large documents, such as books or reports, that they’d like to convert to speech so they can listen to them when commuting. Others prefer to use audio to consume most written content.

Amazon Polly has two limitations that present challenges for large text-to-speech applications:

  • The maximum size of input text for the SynthesizeSpeech API is 1500 billed characters.
  • The maximum number of concurrent requests to the SynthesizeSpeech API per second is 80, with a burst limit of 100.

This post describes the polly-batch-processor application, which overcomes the challenge of processing a text document that exceeds the maximum number of characters supported by Amazon Polly. Polly-batch-processor takes a large text document, breaks it into chunks, generates an audio file for each chunk with Amazon Polly, and consolidates the chunks into a single large MP3 file. AWS Batch asynchronously processes the audio document. To jump directly to the application and configuration steps, click here.

Polly-batch-processor also works for documents that contain many short prompts that are less than 1500 characters; for example, a document containing many short phrases, such as movie titles, that you want to synthesize into a single audio file. Polly-batch-processor generates each sentence asynchronously and in parallel, reducing the time to create the audio file and overcoming any throttling issues with Amazon Polly.

How it works

The following figure shows the application workflow:

  1. You upload a document in .txt format to an S3 bucket. The uploaded document can optionally include metadata, such as the Title and Author of the document, which is provided as S3 object tags. You can also optionally choose to not consolidate the MP3 chunks and to customize the voice that Amazon Polly uses by adding other S3 object tags.
  2. Uploading the document triggers a Lambda function that passes in the S3 bucket name and the key of the object that has just been uploaded.
  3. The Lambda function submits a new AWS Batch job to the AWS Batch queue, providing the S3 bucket name and object key as input parameters.
  4. The AWS Batch job downloads a copy of the document from S3. It also retrieves the S3 object tags, which can optionally contain the Title and Author’s name. Polly-batch-processor splits the text into chunks separated by paragraphs. It detects paragraphs by searching for two line feed (\n\n) characters. If a single paragraph exceeds the limit, the tool splits the paragraph into sentences separated by the dot (‘.’) character. Note: If a sentence contains an abbreviation (for example, St. Paul), the application might split it. In that case, the intonation of the resulting speech will be incorrect.The AWS Batch job processes each text chunk in parallel. The number of chunks processed in parallel depends on the number of vCPUs configured in the AWS Batch job definition. For each chunk, the AWS Batch job calls the SynthesizeSpeech API and returns an audio file in MP3 format. By default, Polly uses the voice id Joanna. To change this, add an S3 object tag with another voice id to the original document. The application uses this value as the voice id in the SynthesizeSpeech API call.By default, the AWS Batch job combines all of the audio chunks into a single MP3 file. It uploads the file to the same S3 bucket where the original document is stored. If you have applied an S3 object tag with the name consolidated and the value false, the AWS Batch job doesn’t consolidate the audio chunks into a single MP3 file. It uploads each chunk individually to the S3 bucket instead.
  1. The AWS Batch job publishes a message to an SNS topic that includes the URL of the MP3 file or the list of MP3 files.
  2. The URL of the MP3 file or the list of MP3 files is sent in email to the end user.

Application components

To create the polly-batch-processor application, you use an AWS CloudFormation template. The template creates the following AWS resources:

  • A number of Lambda functions, including:
    • PollyBookSplitterFunction: Invoked by an object event created by S3, submits a new AWS Batch job for each new document that is uploaded to the S3 bucket.
    • CodeBuildTriggerFunction: Triggers building the Docker image that the AWS Batch job uses to synthesize a chunk of text into speech with Amazon Polly. CloudFormation calls this function as a custom resource.
    • AWSBatchComputeEnvFunction: Creates the AWS Batch compute environment as a CloudFormation custom resource and deletes it.
    • AWSBatchJobQueueFunction: Creates the AWS Batch job queue as a CloudFormation custom resource and deletes it.
    • WSBatchJobDefinitionFunction: Registers the AWS Batch job definition as a CloudFormation custom resource and deregisters it.
  • IAM service roles for AWS CodeBuild and AWS Batch.
  • IAM roles and a security group for the EC2 instances used by AWS Batch.
  • An S3 bucket configured as an S3 website to store both the documents to be converted to an audiobook and the audio and manifest files.
  • An Amazon ECR repository, named polly_document_processor, that stores the Docker image run by the AWS Batch jobs.
  • A CodeBuild project to build the Docker image used by AWS Batch to generate the audiobook. The AWS CodeBuild project pushes the built Docker image to the Amazon ECR repository named polly_document_processor.
  • A custom CloudFormation resource to trigger building the AWS CodeBuild project.
  • An AWS Batch compute environment that launches the EC2 resources needed to run the AWS Batch jobs.
  • An AWS Batch job queue where the Lambda function submits jobs.
  • An AWS Batch job definition that specifies the Docker image and other parameters that are used to execute the AWS Batch job.

The following diagram shows the CloudFormation, AWS CodeBuild, and AWS Batch resources and their interactions:

AWS Batch resources

In the polly-batch-processor application, AWS Batch synthesizes text chunks into speech with Amazon Polly. Polly-batch-processor creates the following AWS Batch resources as custom CloudFormation resources:

  • AWS Batch compute environment: The AWS Batch compute environment manages the EC2 instances used to run the containerized batch jobs. The compute environment is mapped to an AWS Batch job queue. The AWS Batch scheduler takes jobs from the queue and schedules them to run on an EC2 host in the AWS Batch compute environment. To prevent launching too many concurrent calls to Amazon Polly, the maximum number of vCPUs is configured as 128. To prevent launching unnecessary EC2 instances in advance, we start with the desired vCPU of 16, which equals the amount of vCPU for a single job. To avoid having idle EC2 resources running in the compute environment when no jobs are in the queue, we configure the minimum vCPU to be 0. We also provide a security group for each EC2 instance and an IAM instance profile. After all of the jobs in the queue are successfully processed, AWS Batch automatically terminates the EC2 resource in your compute environment as they reach the end of the billing hour. To reduce costs, AWS Batch supports running EC2 resources using Spot Instances. For more information on AWS Batch compute environments, see the AWS Batch User Guide.
  • AWS Batch job definition: An AWS Batch job definition specifies how the batch jobs should be run. Some of the attributes specified in a job definition include:
    • Which Docker image to use with the container in your job
    • How many vCPUs and how much memory to use with the container
    • The command that the container should run when it is started
    • Which, if any, environment variables should be passed to the container when it starts
    • Which, if any, data volumes should be used with the container
    • Which, if any, IAM role your job should use for AWS permissions

    In polly-batch-processor, each job requires 16 vCPU and 2048 MiB of memory because each chunk is processed by a single vCPU in parallel. AWS CodeBuild builds the Docker image that is included in the CloudFormation template and is triggered as a CloudFormation custom resource. The image is stored in Amazon ECR.

  • AWS Batch job queue: AWS Batch jobs are submitted to an AWS Batch job queue, where they reside until they can be scheduled to run on a compute resource. For each compute environment, there can be multiple job queues with different priorities. The AWS Batch scheduler runs the jobs with the higher priority first. In this case, there is only one job queue, so the CloudFormation template sets the queue priority to the arbitrary number 10.

Launching the application

Because AWS Batch is not yet available in all AWS Regions, we use the AWS N. Virginia Region (us-east-1) for this application.

To use the application, you need a VPC with at least one public subnet and one private subnet. If you haven’t already launched the CloudFormation template to set up the VPC and other resources, do so now by choosing this button: . The resulting stack, called PollyBatchNetwork, creates the VPC resources, including subnets, a NAT gateway, and an Internet gateway. The CloudFormation template provides the VPC ID and subnets as output parameters of the stack.

Once you have a VPC with at least one public subnet and one private subnet, launch the following CloudFormation stack to create the AWS resources for the application by choosing the following button: . This template creates a stack named PollyBatchMaster with all of the necessary AWS resources for the polly-batch-processor application.

Provide the following input parameters to the stack:

  • An email address that can receive notifications when a new audiobook has been generated
  • The VPC and private subnets where the AWS Batch compute resources will run

The CloudFormation template provides the name of the S3 bucket where the documents to be converted to audio are copied and where the audiobook files are created.

You will receive an email from SNS asking you to confirm the subscription to the topic name PollyTopic.

That’s it. You’re now ready to convert text to audio.

Converting a document to an audiobook

Converting a document to an audiobook is as simple as copying the document to the application’s S3 bucket. The Lambda function begins processing the file as soon as it arrives there.

As an example, you would use the following command in the AWS command line interface (CLI) to upload a file called the-happy-prince.txt to the S3 bucket. The command also adds the Author tag Oscar Wilde and the Title tag The Happy Prince, encoded as URL query parameters:

aws s3api put-object --bucket <bucket-name> --key books/the-happy-prince.txt --body the-happy-prince.txt --tagging "voiceid=Brian&author=Oscar%20Wilde&title=The%20Happy%20Prince"

After running this command, you receive an email notification with the URL of the consolidated audiobook.

To use other documents:

  • Encode them in UTF-8 format.
  • Save them in text format, with the .txt
  • Upload them with the books/ prefix
  • (Optional) Apply S3 Author and Title tags. These add the MP3 ID3 tags for the Artist and Title of the generated file.

To prevent the application from consolidating the audio chunks into a single MP3 file, upload the document with the S3 object tag name consolidated and value false. The following CLI command is an example:

aws s3api put-object --bucket <bucket-name> --key books/the-happy-prince.txt --body the-happy-prince.txt --tagging "consolidated=false"

After running this command, you receive an email notification with the S3 bucket and key prefix of the audio chunks generated by the application.

By default, the application uses the voice Joanna when synthesizing the text to speech. You can customize the voice that Amazon Polly uses by adding another S3 object tag called voiceid and setting the value of this tag to the ID of the Amazon Polly voice that you want to use. For example, if you want to use the Polly voice called Brian, then upload the document with the following command:

aws s3api put-object --bucket <bucket-name> --key books/the-happy-prince.txt --body the-happy-prince.txt --tagging "voiceid=Brian&author=Oscar%20Wilde&title=The%20Happy%20Prince"

For a list of Polly voices with their ids, run the following CLI command:

aws polly describe-voices --output table

Enhancing the application

Polly-batch-processor works only on documents in UTF-8 text format. To enhance it, you could add support for different document formats, such as HTML, by using a tool like BeautifulSoup. 

Cleaning Up

When you’re done with the application, delete the resources created with the CloudFormation template by deleting the PollyBatchMaster stack. 


If you want to convert documents containing more than 1500 characters to audiobooks, use the polly-batch-processor application. It uses AWS Batch to process a document in chunks asynchronously and in parallel, so you can quickly create audiobooks with Amazon Polly.

If you have any questions or suggestions, please comment below.

About the Author

Matt_McClean_100Matt McClean is a Partner Solution Architect for AWS. He works with technology partners in the EMEA region providing them guidance on developing their solutions using AWS technologies and is a specialist in Machine Learning. In his spare time, he is a passionate skier and cyclist.



Build Your Own Text-to-Speech Applications with Amazon Polly