AWS Machine Learning Blog
Integrating Amazon Polly with legacy IVR systems by converting output to WAV format
Amazon Web Services (AWS) offers a rich stack of artificial intelligence (AI) and machine learning (ML) services that help automate several components of the customer service industry. Amazon Polly, an AI generated text-to-speech service, enables you to automate and scale your interactive voice solutions, helping to improve productivity and reduce costs.
You might face common implementation challenges when updating or modifying legacy interactive voice response (IVR) systems that don’t support file formats such as MP3 and PCM. Amazon Polly, in order to minimize response latency, produces synthesis in real-time and streams the results back to the customer in a streamable format (MP3, Ogg/Vorbis or raw PCM samples) while the request is being processed. WAV audio format is not streamable by definition, but a WAV file can be easily created from a PCM stream generated by Polly at the end of synthesis, when all samples are collected and the length of the result can be calculated. This post shows you how to convert Amazon Polly output to a common audio format like WAV.
Converting Amazon Polly file output to WAV
One of the challenges with legacy systems is that they may not support Amazon Polly file outputs like MP3. The output of the Amazon Polly SynthesizeSpeech
API call doesn’t support WAV, but some legacy IVRs obtain the audio output in WAV file format, which isn’t supported natively in Amazon Polly. Many of these applications are written in Python and Java.
The following sample code which will help in such situations where audio is in WAV file format not supported natively in Amazon Polly. The sample code converts files from PCM to WAV in Python for inputs given in both SSML and text.
The following is the sample output from the preceding code:
Conclusion
You can convert Amazon Polly output from PCM to WAV so that you can use Amazon Polly in your legacy IVR, enabling it to support WAV file format output. Try this out for yourself and let us know how it goes in the comments!
You can further refine the converted file using the powerful capabilities available in Amazon Polly like the SynthesizeSpeech request, managing lexicons, reserved characters in SSML, and controlling volume, speaking rate, and pitch.
About the Author
Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.