AWS Machine Learning Blog

Integrating Amazon Polly with legacy IVR systems by converting output to WAV format

Amazon Web Services (AWS) offers a rich stack of artificial intelligence (AI) and machine learning (ML) services that help automate several components of the customer service industry. Amazon Polly, an AI generated text-to-speech service, enables you to automate and scale your interactive voice solutions, helping to improve productivity and reduce costs.

You might face common implementation challenges when updating or modifying legacy interactive voice response (IVR) systems that don’t support file formats such as MP3 and PCM. Amazon Polly, in order to minimize response latency, produces synthesis in real-time and streams the results back to the customer in a streamable format (MP3, Ogg/Vorbis or raw PCM samples) while the request is being processed. WAV audio format is not streamable by definition, but a WAV file can be easily created from a PCM stream generated by Polly at the end of synthesis, when all samples are collected and the length of the result can be calculated. This post shows you how to convert Amazon Polly output to a common audio format like WAV.

Converting Amazon Polly file output to WAV

One of the challenges with legacy systems is that they may not support Amazon Polly file outputs like MP3. The output of the Amazon Polly SynthesizeSpeech API call doesn’t support WAV, but some legacy IVRs obtain the audio output in WAV file format, which isn’t supported natively in Amazon Polly. Many of these applications are written in Python and Java.

The following sample code which will help in such situations where audio is in WAV file format not supported natively in Amazon Polly. The sample code converts files from PCM to WAV in Python for inputs given in both SSML and text.

#The following sample code snippet converts files from PCM to WAV in Python for both SSML and non SSML text

#Importing libraries
import boto3
import wave
import os

#Initializing variables
CHANNELS = 1 #Polly's output is a mono audio stream
RATE = 16000 #Polly supports 16000Hz and 8000Hz output for PCM format
OUTPUT_FILE_IN_WAVE = "sample_SSML.wav" #WAV format Output file  name
WAV_SAMPLE_WIDTH_BYTES = 2 # Polly's output is a stream of 16-bits (2 bytes) samples

#Initializing Polly Client
polly = boto3.client("polly")

#Input text for conversion
INPUT = "<speak>Hi! I'm Matthew. Hope you are doing well. This is a sample PCM to WAV conversion for SSML. I am a Neural voice and have a conversational style. </speak>" # Input in SSML

WORD = "<speak>"
	if WORD in INPUT: #Checking for SSML input
        #Calling Polly synchronous API with text type as SSML
		response = polly.synthesize_speech(Text=INPUT, TextType="ssml", OutputFormat="pcm",VoiceId="Matthew", SampleRate="16000") #the input to sampleRate is a string value.
		 #Calling Polly synchronous API with text type as plain text
		response = polly.synthesize_speech(Text=INPUT, TextType="text", OutputFormat="pcm",VoiceId="Matthew", SampleRate="16000")
except (BotoCoreError, ClientError) as error:

#Processing the response to audio stream
STREAM = response.get("AudioStream")


The following is the sample output from the preceding code:


You can convert Amazon Polly output from PCM to WAV so that you can use Amazon Polly in your legacy IVR, enabling it to support WAV file format output. Try this out for yourself and let us know how it goes in the comments!

You can further refine the converted file using the powerful capabilities available in Amazon Polly like the SynthesizeSpeech request, managing lexicons, reserved characters in SSML, and controlling volume, speaking rate, and pitch.


About the Author

Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.