Using Amazon Polly to Deliver Health Care for People with Long-Term Conditions

This is a guest post by Michael Wray, senior software architect at Inhealthcare. Founded in 2012, Inhealthcare has created a digital infrastructure which supports remote home monitoring for the entire UK population.

Listen to this post

Voiced by Amazon Polly

With an aging population that continues to grow, healthcare is being changed forever. Are we ready for it? Which cost-effective technologies can we use to meet the ever-increasing demands on healthcare-related services?

With the right technology, many needs related to healthcare can be met remotely. This is already implemented by the National Health Service (NHS) in the UK. Although remote healthcare is far from being widespread, innovative organizations are realizing that by tapping into low-cost digital health solutions, some great efficiencies can be delivered at scale.

Despite being a dinosaur in the communications space, automated telephony can be a perfect communication channel to deploy services at scale because nearly everyone can use it, even if they do not have access to the internet or own a smartphone. And for many older people, the telephone is a piece of technology they are comfortable with and confident using.

In this post, we highlight how Inhealthcare has enabled NHS healthcare providers to leverage the capabilities of Amazon Polly in connection with remote communications. We show how Amazon Polly can be used at design time with our call script design tools to help design and simulate automated telephone calls. We illustrate how protocols can be built into automated telephone call scripts, how telephone calls are placed, and how synthesized speech is generated by Amazon Polly and streamed down the telephone line.

Inhealthcare provides a digital health platform that specializes in providing care in the UK, outside of hospital walls. The Inhealthcare platform connects to existing established healthcare software systems and enables clinical protocols and pathways to be modeled, created, tested, executed, and monitored. An important factor in delivering services remotely is to use an appropriate communication method. While apps, wearables, and web access are suitable for certain people, many individuals struggle with using these advanced technologies. Simpler alternatives like text messaging or automated telephony provide a better solution. As a platform provider, we support all of these communication channels, but in this post we focus on how we use Amazon Polly with automated telephony.

IVR

IVR (interactive voice response) has been around for ages, and it is for this reason that nearly everybody knows how to use it. Whether you experienced it as a reminder to set your watch with the help of the speaking clock, or as a nuisance call asking you about the recent injury you didn’t have, like most people, you have experienced IVR. This is important when delivering healthcare on a national basis: it must be simple and inclusive. IVR enables two-way communication; the computer can communicate with the human using a synthesized voice, and the human can communicate with the computer by using dual tone multi frequency (DTMF) codes. These are the codes you hear when you press on the buttons of the keypad.

How it works

The Inhealthcare platform includes the digital pathway engine, which automatically manages and orchestrates remote communications. The integrated development environment (IDE) provides the tooling to design and build clinical pathways and protocols, which are published to the digital pathway engine. The call script designer, an element of the IDE, is used for constructing automated telephone calls.

At the appropriate time, and adhering to a clinical protocol that has been published to the digital pathway engine, a message is sent to the Voice Messaging System (VMS), a micro-service responsible for managing telephone calls. A phone call could last anywhere from a few seconds to several minutes, depending on the complexity of its call script. It is the responsibility of the VMS to interpret the call script, manage the state of a phone call, and report the state back to the digital pathway engine. In progressing through the call script, the VMS queues up commands for the Telephony Interface Manager (TIM) to execute. The first command is to place a call. This is done using Asterisk, an open source PBX system that is configured to connect to a remote SIP trunk provider. SIP (Session Initiation Protocol) is a protocol commonly used by telephony systems.

After the call is established, the VMS steps through the call script. Information is delivered as synthesized speech retrieved from Amazon Polly, and responses from the call recipient take the form of button presses on their telephone keypad (DTMF codes). To sound like a realistic conversation, it is vital that Amazon Polly responds quickly. Delays and dead air time cause frustration and increase the likelihood of hang up.

Before using Amazon Polly, we used a locally hosted text-to-speech (TTS) engine. Initially we had concerns that Amazon Polly might not respond quickly enough, but, instead, we have found it to have very low latency. A great advantage of Amazon Polly is its cost effectiveness: TTS can use significant CPU and RAM, but with Amazon Polly this is no longer something we need to worry about. It has a very simple pay-as-you-go pricing model that is based on usage. With a sensible caching strategy, costs can be reduced even further. Using a simple algorithm, we split the text into sentences, and if that exact sentence has already been synthesized, it is retrieved directly from a local cache. We currently see a cache hit more than 80% of the time.

Monitoring

Amazon Polly metrics are integrated with Amazon CloudWatch out of the box, so it is easy to configure monitors and alarms to keep track of Amazon Polly performance. However, this only tells part of the story. We also have our own monitoring, based on the useful Coda Hale metric library, so we can check things like full round-trip times and cache hits. These are currently reported up to New Relic, but they could just as easily be sent to Amazon CloudWatch. As the following graph shows, we find that Amazon Polly typically has about a 50 ms latency.

Throttling

Amazon Polly enforces throttling on both the number of concurrent requests, and on the rate of requests per second. Go over these limits, and an exception will be returned. To counter this, we create a configurable pool size for the group of threads that perform speech synthesizing. Although we can process many concurrent calls, we delegate speech to just a small thread group with an in-memory blocking queue.

Thread pool for TTS

The java code to restrict to 10 the number of concurrent connections to Amazon Polly is shown below.

private int workerThreads = 10;
ExecutorService executorService = Executors.newFixedThreadPool(workerThreads);

To ensure that the overall rate doesn’t get breached, we use a rate limiter from the Google Guava project.

Rate limiting

The java code to limit the rate to 20 requests per second to Amazon Polly is shown below.

private double maxRatePerSecond = 20.0;
this.rateLimiter = RateLimiter.create(maxRatePerSecond);
double acquire = rateLimiter.acquire();

Speech Synthesis Markup Language (SSML)

The input to Amazon Polly can be raw text or SSML. We use SSML because it allows for greater control over how the speech is synthesized. Currently, we don’t use many of these control features, but in the future, we expect to use them more, and integrate them into the call script design tool. For example, we could use the control to slow down the speech rate. We do use ‘x-loud’ prosody, to ensure that the speech is easy to hear, because many of our listeners are elderly. We also use a sample rate of 8 kHz and a format of pulse-code modulation (PCM), which is what telephony typically expects. Brian is our favored voice; early feedback suggested that this was the preferred choice.

Calling Amazon Polly

The java code to request the synthesized speech from Amazon Polly is shown below.

final SynthesizeSpeechRequest createSpeechRequest = new SynthesizeSpeechRequest();
createSpeechRequest.setText("<speak><prosody volume='x-loud'>"+text+"</prosody></speak>");
createSpeechRequest.setTextType(TextType.Ssml);
createSpeechRequest.setVoiceId(VoiceId.Brian);
createSpeechRequest.setOutputFormat(OutputFormat.Pcm);
request.setSampleRate("8000");
SynthesizeSpeechResult createSpeech = cloud.synthesizeSpeech(request);
byte[] byteArray = IOUtils.toByteArray(createSpeech.getAudioStream());

The call script design tool

We also use Amazon Polly during the design of a call script, because it provides immediate feedback on how the telephone call will sound. The call script design tool (shown in the following diagram) streamlines this process, providing immediate feedback by simulating the telephone call. We watch, listen, and interact as the simulated telephone call steps through the call script. The softphone keypad on the right is used to interact with the simulated call, and the transcript is printed out above it. Furthermore, these simulations can be wrapped up into automated regression packs to verify and audit the flows through the call script.

Domain Specific Language (DSL)

A simple DSL is used to describe the call script. In the preceding diagram, the panel on the right gives a visual representation of the flow through the call script. The DSL allows branching, conditional logic, and variables to be used, so flows can be as simple or complex as needed.

The snippet of DSL below would read out synthesized speech to ask for a response of either 1 or 2 on the telephone keypad.

The call script designer allows synthesized speech to be automatically produced by clicking a node on the diagram. Being able to hear exactly how something will sound in the live production system during the design stage is a powerful feature. It gives immediate feedback, leading to a much faster and more effective design process.

Conclusion

Despite the old-fashioned nature of the telephone, it offers an efficient, safe, and cost-effective way to enable home-care communication. Many individuals with long-term conditions have been using digital health-related services that are based on the telephone for over three years. They enjoy the freedom it gives them. We are committed to supporting remote communication for the UK’s aging population across all channels, and by leveraging Amazon Polly’s text-to-speech service we have found a low-latency and low-cost solution to scale automated telephony.

Additional Reading

Learn how Duolingo powered language learning with Amazon Polly

Updated April 24, 2018 – The previous disclaimer included in this blog has been removed since Amazon Polly is now HIPAA compliant. To learn more, visit here.