AWS Machine Learning Blog

Customize pronunciations using Amazon Polly

Amazon Polly breathes life into text by converting it into lifelike speech. This empowers developers and businesses to create applications that can converse in real time, thereby offering an enhanced interactive experience. Text-to-speech (TTS) in Amazon Polly supports a variety of languages and locales, which enables you to perform TTS conversion according to your preferences. Multiple factors guide this choice, such as geographic location and language locales.

Amazon Polly uses advanced deep learning technologies to synthesize text to speech in real time in various output formats, such as MP3, ogg vorbis, JSON, or PCM, across standard and neural engines. The Speech Synthesis Markup Language (SSML) support for Amazon Polly further bolsters the service’s capability to customize speech with a plethora of options, including controlling speech rate and volume, adding pauses, emphasizing certain words or phrases, and more.

In today’s world, businesses continue to expand across multiple geographic locations, and they’re continuously looking for mechanisms to improve personalized end-user engagement. For instance, you may require accurate pronunciation of certain words in a specific style pertaining to different geographical locations. Your business may also need to pronounce certain words and phrases in certain ways depending on their intended meaning. You can achieve this with the help of SSML tags provided by Amazon Polly.

This post aims to assist you in customizing pronunciation when dealing with a truly global customer base.

Modify pronunciation using phonemes

A phoneme can be considered as the smallest unit of speech. The <phoneme> SSML tag in Amazon Polly helps customize pronunciation based on phonemes using the IPA (International Phonetic Alphabets) or X-SAMPA (Extended Speech Assessment Methods Phonetic Alphabet). X-SAMPA is a representation of IPA in ASCII encoding. Phoneme tags are available and fully supported in both the standard and neural TTS engine. For example, the word “lead” can be pronounced as the present tense verb, or it can refer to the chemical element lead. We will discuss this with an example further in this blog post.

International Phonetic Alphabet

The IPA is used to portray sounds across different languages. For a list of phonemes Amazon Polly supports, refer to Phoneme and Viseme Tables for Supported Languages.

By default, Amazon Polly determines the pronunciation of the word in a specific format. Let’s use the example of the word “lead,” which can have different pronunciations when referring to the chemical element or the verb. In this example, when we provide the word “lead” as input, it’s spoken in the present tense form (without the use of any customizing SSML tags). The default pronunciation for L E A D by Amazon Polly is the present tense form of “lead.”

<speak>
The default pronunciation by Amazon Polly for L E A D is <break time = "300ms"/> lead,
which is the present tense form.
</speak>

To return the pronunciation of the chemical element lead (which can also be the verb in past tense), we can use phonemes along with IPA or X-SAMPA. IPA is generally used to customize the pronunciation of a word in a given language using phonemes:

<speak>
This is the pronunciation using the
<say-as interpret-as="characters">IPA</say-as> attribute
in the <say-as interpret-as="characters">SSML</say-as> tag. 
The verb form for L E A D is <break time="150ms"/> lead.
The chemical element <break time="150ms"/><phoneme alphabet="ipa" ph="lɛd">lead</phoneme> 
<break time="300ms"/>also has an identical spelling.
</speak>

Modify pronunciation by specifying parts of speech

If we consider the same example of pronouncing “lead,” we can also differentiate between the chemical element and the verb by specifying the parts of speech using the <w> SSML tag.

The <w> tag allows us to customize pronunciation by specifying parts of speech. You can configure the pronunciation in terms of verb (present simple or past tense), noun, adjective, preposition, and determiner. See the following example:

<speak>
The word<p> <say-as interpret-as="characters">lead</say-as></p> 
may be interpreted as either the present simple form <w role="amazon:VB">lead</w>, 
or the chemical element <w role="amazon:SENSE_1">lead</w>.
</speak>

Additionally, you can use the <sub> tag to indicate the pronunciation of acronyms and abbreviations:

<speak>
Polly is an <sub alias="Amazon Web Services">AWS</sub> 
offering providing text-to-Speech service. 
</speak>

Extended Speech Assessment Methods Phonetic Alphabet

The X-SAMPA transcription scheme is an extrapolation to the various language-specific SAMPA phoneme sets available.

The following snippet shows how you can use X-SAMPA to pronounce different variations of the word “lead”:

<speak>
This is the pronunciation using the X-SAMPA attribute, 
in the verb form <break time="1s"/> lead.
The chemical element <break time="1s"/> 
<phoneme alphabet='x-sampa' ph='lEd'>lead</phoneme> <break time="0.5s"/>
also has an identical spelling.
</speak>

The stress mark in IPA is usually represented by ˈ. We often encounter scenarios in which an apostrophe is used instead, which might give a different output than expected. In X-SAMPA, the stress mark is the double quotation mark, therefore we should use a single quotation mark for the word and specify the phonemic alphabet. See the following example:

<speak>
You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>. 
</speak>

In the example above, we can see the character ˈ used for stressing the word. Similarly, the stress mark in X-SAMPA is shown in double quotation below:

<speak>
You say, <phoneme alphabet='x-sampa' ph='pI"kA:n'>pecan</phoneme>.
</speak>

Modify pronunciations using other SSML tags

You can use the <say as> tag to modify pronunciation by enabling the spell-out or character feature. Furthermore, it enhances pronunciations in terms of digits, fractions, unit, date, time, address, telephone, cardinal, and ordinal, and can also censor the text enclosed within the tag. For more information, refer to Controlling How Special Types of Words Are Spoken. Let’s look at examples of these attributes.

Date

By default, Amazon Polly speaks out different text inputs. However, for handling specific attributes such as dates, you can use the date attribute to customize pronunciation in the required format, such as month-day-year or day-month-year.

Without the date attribute, Amazon Polly provides the following output when speaking out dates:

<speak>
The default pronunciation when using date is 01-11-1996
</speak>

However, if you want the dates spoken in a specific format, the date attribute in the <say-as> tags helps customize the pronunciation:

<speak>
We will see the examples of different date formats using the date SSML tag.
The following date is written in the day-month-year format.
<say-as interpret-as="date" format="dmy">01-11-1995</say-as><break time="500ms"/>
The following date is written in the month-day-year format.
<say-as interpret-as="date" format="mdy">09-24-1995</say-as>
</speak>

Cardinal

This attribute represents a number in its cardinal format. For example, 124456 is pronounced “one hundred twenty four thousand four hundred fifty six”:

<speak> 
The following number is pronounced in it's cardinal form.
<say-as interpret-as="cardinal">124456</say-as>
</speak>

Ordinal

This attribute represents a number in its ordinal format. Without the ordinal attribute, the number is pronounced in its numerical form:

<speak>
The following number is pronounced in it's ordinal form 
without the use of any SSML attribute in the say as tag - 1242 
</speak>

If we want to pronounce 1242 as “one thousand two hundred forty second,” we can use the ordinal attribute:

<speak>
The following number is pronounced in it's ordinal form.
<say-as interpret-as="ordinal">1242</say-as>
</speak>

Digits

The digits attribute is used to speak out the numbers. For example, “1234” is pronounced as “one two three four”:

<speak>
The following number is pronounced as individual digits.
<say-as interpret-as="digits">1242</say-as>
</speak>

Fraction

The fraction attribute is used to customize the pronunciations in the fractional form:

<speak> 
The following are examples of pronunciations when 
<prosody volume="loud"> fraction</prosody>
is used as an attribute in the say -as tag. 
<break time="500ms"/>Seven one by two is pronounced as
<say-as interpret-as="fraction">7 ½ </say-as>
whereas three by twenty is pronounced as <say-as interpret-as="fraction">3/20</say-as>
</speak>

Time

The time attribute is used to measure the time across minutes and seconds:

<speak>
Polly also supports customizing pronunciation in terms of minutes and seconds. 
For example, <say-as interpret-as="time">2'42"</say-as>
</speak>

Expletive

The expletive attribute censors the text enclosed within the tags:

<speak> 
The value that is going to be censored is
<say-as interpret-as="expletive">this is not good</say-as>
You should have heard the beep sound.
</speak>

Telephone

To pronounce telephone numbers, you can use the telephone attribute to speak out telephone numbers instead of pronouncing them as standalone digits or as a cardinal number:

<speak>
The telephone number is 
<say-as interpret-as="telephone">1800 3000 9009</say-as>
</speak>

Address

The address attribute is used to customize the pronunciation of an address aligning to a specific format:

<speak> 
The address is<break time="1s"/>
<say-as interpret-as="address">440 Terry Avenue North, Seattle
WA 98109 USA</say-as>
</speak>

Lexicons

We’ve looked at some of the SSML tags readily available in Amazon Polly. Other use cases might require a higher degree of control for customized pronunciations. Lexicons help achieve this requirement. You can use lexicons when certain words need to be pronounced in a certain form that is uncommon to that specific language.

Another use case for lexicons is with the use of numeronyms, which are abbreviations formed with the help of numbers. For example, Y2K is pronounced as the “year 2000.” You can use lexicons to customize these pronunciations.

Amazon Polly supports lexicon files in .pls and .xml formats. For more information, see Managing Lexicons.

Conclusion

Amazon Polly SSML tags can help you customize pronunciation in a variety of ways. We hope that this post gives you a head start into the world of speech synthesis and powers your applications to provide more lifelike human interactions.


About the Authors

Abilashkumar P C is a Cloud Support Engineer at AWS. He works with customers providing technical troubleshooting guidance, helping them achieve their workloads at scale. Outside of work, he loves driving, following cricket, and reading.

Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.