AWS Contact Center

SSML in Amazon Connect Contact Flows

For many contact center technologists, the ability to manage voice prompts by using text-to-speech has become a vital component of an ACD platform. The troublesome task of managing recordings has been replaced with the agility of real-time prompt changes through editing a text box. While the use of text-to-speech is convenient, the result of a robotic sounding voice doesn’t usually exceed the caller’s expectations. Fortunately, SSML (synthesized speech mark-up language) is a tool that can add the personal touch to a text-to-speech prompt.

SSML customizes your speech prompts by inserting elements to adjust aspects such as speed, volume, and pronunciation. These functions are managed through “tags,” which identify the action to be taken and the associated attribute. To experience the changes that are made to the prompt, users can navigate to the Amazon Polly service from the AWS Management Console. They can also modify a prompt within a contact flow. It’s possible to manage SSML tagging through the AWS Command Line Interface (AWS CLI). However, this article is focused on the use of SSML in Amazon Connect. Amazon Connect SSML functionality uses the Amazon Polly service, so all Amazon Polly documentation around SSML is applicable to SSML prompts within the Amazon Connect contact flows. Currently, the following SSML tags are supporting in Amazon Connect:

·       speak ·       phoneme ·       w
·       p ·       sub ·       mark
·       say-as ·       lang ·       s
·       break ·       prosody ·       amazon:effect name=”whispered”

 

There are two areas of speech that influence the quality of the spoken word: the pace and the inflection. In this post, we explain the fundamentals of using SSML and the application of SSML tags that address these qualities: break and prosody. Additionally, understanding the use of the say-as tag assists in reading back numerical values that have context, such as a phone number or a fraction.

SSML Fundamentals

When using tags, you insert the initiating tag immediately before the word or words that you want affected. Every tag must also have a stop bracket, which is a forward slash before the command.

The first step in using SSML in your contact center IVR prompts is to enclose the text in <speak> tags. This identifies the text that is enhanced by SSML. The <speak> element is the root element of all SSML text.

<speak>Thank you for calling our customer support center. </speak>

With most SSML tags, you insert the desired behavior with an element. Also, you define the associated attribute of the element in the same bracket. For example, if the desired behavior is to change the pitch of a word, the user would first insert the element of prosody. Then, specify the attribute to be modified and the value of the modification, inside of quotation marks.

<prosody pitch=”x-high”>

In the previous example, prosody is the element to be inserted, pitch is the attribute to be modified, and the value of the pitch is x-high.

Managing the Pace

Pause

One of the most common use cases for SSML is the need to insert a pause in a prompt. To do this, the <break> element can be inserted in the text. The degree that the pause takes can be defined numerically, using milliseconds or seconds, or relative to the natural pauses that occur using a comma, period, or paragraph break.

Element
<break time=”XXXms”/> <break strength=”XXX”/>
Attributes & Values
Time
Milliseconds (ms) Max duration is 100000ms
Seconds (s) Max duration is 10s
Strength
none Used to remove the default pause after a period
x-weak Same strength as none
weak Relative strength to a comma
medium Same strength as weak
strong Relative strength to a period
x-strong Relative strength to a paragraph

 

Example:

<speak>Thank you for calling our customer support center. <break time="2s"/> To route your call propertly<break time="100ms"/>please listen to your menu options</speak>

Because the pause element is not a behavior that is given to text, the stop command is used inside of the same bracket.

 

Modifying Speed, Volume, and Pitch

There is a single element that is used to adjust the speed, volume, and pitch of one or many words. The element of <prosody> is inserted, along with the attribute that is to be adjusted and the value for that attribute.

Element
<prosody volume=”XXX”>
<prosody rate=”XXX”>
<prosody pitch=”XXX”>
Attributes & Values
volume
x-soft
soft
medium
loud
x-loud
+ndB (+6dB is 6dB previous default)
-ndB (-4dB is 4dB following default)
rate
x-slow
slow
medium
fast
x-fast
n% (200% is twice the default of medium)
pitch
x-low
low
medium
high
x-high
+n% (+5% adjusts default pitch of medium to 5% or relative to the last assigned pitch)
-n% (-5% adjusts default pitch of medium to less 5% or relative to the last assigned pitch)

 

When modifying any of the prosody attributes, the most efficient way to evaluate the effect of a value change is to use the Amazon Polly interface from the AWS Management Console.

Example:

<speak><amazon:auto-breaths><prosody pitch="high">Hi!</prosody> Thank you for calling our customer support center. To route your call properly, please select from the following options. <prosody rate="75%">Press one for sales. Press two for billing.</prosody></amazon:auto-breaths></speak>

In this example, we add three elements to affect the prompt: pitch in the greeting, and rate in the menu. To fine-tune the prompt, use the Amazon Polly interface from the AWS Management Console and adjust the values associated with the two prosody elements.

 

Say As

One of the challenges that users encounter with text-to-speech is enabling unique characters, words, and numbers to be read in a particular context. For example, the desire to have dates read in the date format. While the application of the <say-as> tag can be complex, it’s a necessity when you are including data components that next to be spoken contextually.

The single attribute that is used for the <say-as> tag is <interpret-as>.

Element
<say-as interpret-as=”XXXXX”>[text to be interpreted]</say-as>
Attributes & Values

character or

spell-out

Spells out each letter of the text
<say-as interpret-as=”character”>S3</say-as>
cardinal or number States the number as a cardinal number, that is, five thousand, two hundred, forty-five
<say-as interpret-as=”cardinal”>5245</say-as>
ordinal Reads the number as an ordinal number, that is, 52 would read as “fifty second”
<say-as interpret-as=”ordinal”>52</say-as>
digits Reads out each number
<say-as interpret-as=”digits”>5264</say-as>
fraction Reads the numbers as fractions. To accommodate a mixed number, such as 3 ½, a “+” is inserted.
<say-as interpret-as=”fraction”>7/8</say-as>
<say-as interpret-as=”fraction”>4+3/4</say-as>
unit Enables a measurement unit to be read back. The value is a number and the unit of measurement, without any spaces.
<say-as interpret-as=”unit”>4+1/2feet</say-as>
date Enables dates to be read back in multiple formats:
mdy: Month-day-year
dmy: Day-month-year
ymd: Year-month-day
md: Month-day
dm: Day-month
ym: Year-month
my: Month-year
d: Day
m: Month
y: Year

yyyymmdd: Year-month-day

Inserting question marks results in those parts of the date being skipped.

<say-as interpret-as=”date” format=”mdy”>04-25-1967</say-as>
<say-as interpret-as=”date”>????0422</say-as>
Time

Reads back the numerical text as a duration, in minutes and

seconds

<say-as interpret-as=”time”>5’36″</say-as>
Address Interprets the text as part of a street address
<say-as interpret-as=”address”>1215 Park Ave</say-as>

Conclusion

SSML is a robust language that is changing the way we interact with our customers. While there are hundreds of elements that can be used to enhance text-to-speech functionality, the break, prosody, and say-as tags can easily be added to contact flows for a more natural sounding voice. For more information about SSML tags supported in Amazon Polly, see the Amazon Polly Developer Guide.