AWS Machine Learning Blog

Amazon Polly releases new SSML Breath feature

Natural human speech frequently includes audible breathing sounds as a speaker inhales or exhales during normal speaking. For example, when we speak, we generally take a breath at major pauses.

Narrations without breathing sounds produced by Text-to-Speech (TTS) engines often the lack naturalness of a human narrator. Most TTS systems don’t include respiratory sounds in the speech output, but rather simply insert pauses between words or phrases of speech. While such silent breaks might be sufficient when producing shorter segments of speech, the insertion of breath sounds can result in more natural sounding speech, particularly for long-form narration.

 Amazon Polly is a Text-to-Speech service that uses advanced deep learning technologies to synthesize speech that sounds like a human voice. Today, Amazon Polly releases a new Speech Synthesis Markup Language (SSML) Breath feature, which allows you to insert appropriate pauses to make your speech sound even more natural, as if the text is being narrated by a human speaker.

You can use the <amazon:breath> and <amazon:auto-breaths> tags, or a combination of both, to incorporate breath sounds into your speech output.

  • Manual mode: Use the <amazon:breath/> tag to manually set breathing noises. You simply place the tag in the input text wherever you want to insert a breath. You can customize the tag using the duration and volume attributes.
  • Automated mode: Use the <amazon:auto-breaths> tag to tell Amazon Polly to automatically create breathing noises at appropriate intervals. You can set the frequency of these intervals, as well as their volume and duration, to meet your needs. You place the tag at the beginning of the text you want to apply it to and close the tag at the end of the text.
  • Mixed mode: For the greatest amount of flexibility when creating breathing tags, you can combine automated mode breath tags with manual breath tags. This way you can have an automated breath pattern throughout the entire text while still ensuring that you have a breath noise in a specific location.

Listen to an audio sample that uses the mixed mode.

Listen now

Voiced by Amazon Polly

<speak>
	<amazon:auto-breaths frequency="low" volume="soft" duration="x-short">Amazon Polly is a service that turns text into lifelike speech, for creating applications that talk, and building entirely new categories of speech-enabled products. Amazon Polly is a Text-to-Speech service, that uses advanced deep learning technologies to synthesize speech that sounds like a human. With dozens of lifelike voices, variety of languages, you can select the ideal voice and build speech-enabled applications that work in many different countries.</amazon:auto-breaths>
</speak>

Additionally, the acoustic features of breath sounds and duration between breaths can vary widely. Listeners often take breath sounds as unconscious cues on how to process the speech that they hear. For example, rapid breathing may indicate more urgency to the speech, while slower and longer breathing may be a display of indecisiveness. Insertion of breath sounds can more accurately help with the delivery of breaks between intonational phrases, etc.

Amazon Polly supports standard SSML tags such as prosody, which enables you to control the volume, rate, and pitch of the speech out. In the following example, we demonstrate how you can use manual <amazon:breath> and <prosody> tags together to convey emotional or dramatic tone in speech.

Scared Matthew:

Listen now

Voiced by Amazon Polly

<speak>
     <amazon:breath duration='medium' volume='x-loud'/><prosody rate='115%'> <prosody volume='x-loud'> Salli? <break time='300ms'/> </prosody> Is that you?</prosody>
</speak>

Uncertain Matthew:

Listen now

Voiced by Amazon Polly


<speak> 
     <prosody rate='50%'> I am not sure <amazon:breath duration='x-long' volume='soft'/> <break time='200ms'/>  I think I need to think about it. </prosody> 
</speak>

Breathless Salli:

Listen now

Voiced by Amazon Polly

 

<speak> 
     <amazon:breath duration='long' volume='x-loud'/><prosody rate='120%'> <prosody volume='loud'> Wow! <amazon:breath duration='long' volume='loud'/> </prosody> That was quite fast <amazon:breath duration='medium' volume='x-loud'/> I almost beat my personal best time on this track. </prosody> 
</speak>

Copy these examples and paste them into the Amazon Polly console and give it a try! By incorporating breath sounds into speech output from text, Amazon Polly is able to provide more naturally sounding speech, particularly for long-form text narration.

Log in to the Amazon Polly console and try the SSML Breath feature and visit the Amazon Polly documentation for more information on SSML tags.


 

About the Author

Binny Peh is a Sr. Product Marketing Manager for AWS machine learning solutions. In her spare time, she indulges in too much television and is an aspiring foodie. Binny’s glass is always half-full, and she believes in the power of positive thinking.