Building a Reliable Text-to-Speech Service with Amazon Polly

Listen to this post

This is a guest post by Yiannis Philipopoulos, a Software Developer at Bandwidth. In Yiannis’ words: “Bandwidth’s solutions are shaping the future of how we connect with voice and messaging for mobile apps and large-scale, enterprise-level solutions. At the core of Bandwidth’s business-grade Communications Platform as a Service (CPaaS) offering are communication APIs that allow companies to launch and scale next-generation apps and solutions using the nation’s largest VoIP network.”

Text-to-speech (TTS) technology is evolving rapidly. Thanks to machine learning, computers’ ability to disambiguate text and combine individual sounds into natural-sounding whole words has improved dramatically. Although Amazon Polly provides excellent TTS at low cost, many still use older TTS technologies because they believe that upgrading to a new system isn’t worth the effort.

Bandwidth’s customers use TTS primarily to vocalize menus, reminders, and order information. Bandwidth’s API lets customers quickly purchase telephone numbers, send texts, make calls, and, create static or dynamic voice messages. In this post, I show how Bandwidth integrated Amazon Polly to provide on-demand TTS capabilities. I also offer some simple suggestions for leveraging Amazon Polly’s ability to cache results.

I explain how to use Amazon Polly to most effectively meet our customer’s needs. To illustrate, I have provided TTS sample service demo code that you can run right out of the box. The demo code calls Amazon Polly using many of the improvements I discuss in this blog. For caveats for using the demo service, see the README file.

The service

The workflow for the Bandwidth API is straightforward. When the Bandwidth API receives a request for TTS, it directs the request to our new internal service. The internal service will check for a cached response, then direct the request to Amazon Polly, or, if Amazon Polly can’t complete the request for any reason, to a lower-priority TTS vendor. Finally, we store the successful speech result in our cache.

Why (primarily) Amazon Polly?

Why did we decide to use Amazon Polly as our primary TTS engine? We reviewed the most important requirements for our use case: uptime, choice of voices, speed, interpretation, cost, and caching. Amazon Polly delivers on all these requirements.

Of course, uptime is an important requirement for any vendor. Issues with our previous vendor’s uptime drove us to investigate new solutions, and made building a redundant system extremely important.

With our previous (now secondary) vendor, we gave customers access to a variety of voices. Our new vendor had to match our previous selection of voices. Amazon Polly offers at least one male and female voice in every language we support, and has a variety of voices in English.

Speed was another important factor. Because menus and the exchange of live information are a big part of our use case, the ability to start streaming a response back to customers as soon as possible is critical. Nobody likes waiting on a phone menu.

When dealing with text generated by customers, contextual interpretation of input is also important. Although Amazon Polly’s option to use SSML to guarantee outcomes is an excellent feature, it’s impossible to know the intent all of the text that our customers send us. Having a service that can successfully disambiguate English, for example, to properly read “live” in the text “I live at this address,” as opposed to “we broadcast live,” considerably improves the user experience. This is a feature that Amazon Polly diligently supports. In our testing, we found a single case where Amazon Polly did not generate the expected speech in response to input. The Amazon Polly team was eager to hear about that case. Now the audio plays as expected.

Cost and caching go together. Caching is another feature that made Amazon Polly very appealing. Although providing customers the ability to quickly convert dynamic messages to voice is important, many messages are frequently repeated. Our previous vendor did not allow caching of responses, which required our API to call for every request. Being able to cache static messages can reduce cost significantly. At Bandwidth, we’re seeing a 78% cache hit rate, resulting in far fewer requests.

Integrating Amazon Polly

Integrating Amazon Polly into your service requires getting the audio stream from Amazon Polly. I will show you how to get that audio stream from Amazon Polly and some lessons learned from the integration.

Getting audio from Amazon Polly

Adding a new vendor to a multi-vendor TTS system is a surprisingly small part of the development effort required to build a full system. The following Java code is all that’s required to get an audio stream from Amazon Polly:

   public Optional<VendorResponse> textToSpeech(final Vendor vendor, final String text, final VoiceNames voice, final AudioFormats audioFormat) {
       final SynthesizeSpeechRequest request = new SynthesizeSpeechRequest();
       request.setOutputFormat(audioFormat.getAwsOutputFormat());
       request.setText(text);
       request.setSampleRate(16000);
       request.setVoiceId(voice.getAwsVoiceId());

       try {
           final SynthesizeSpeechResult synthesizeSpeechResult = client.synthesizeSpeech(request);
           return Optional.of(new VendorResponse(synthesizeSpeechResult.getAudioStream(), Vendor.POLLY));
       } catch (final Exception e) {
           LOG.error(e.getMessage());
           return Optional.empty();
       }
   }

How do you use the audio stream best? We have two objectives to fulfill with our stream:

To start streaming the audio back as soon as possible
To cache the response

You have options for streaming the InputStream from Amazon Polly to the OutputStream for your clients. The simplest is to use IOUtils.copy(). However, one of our requirements, which is probably a common requirement for TTS, is to make sure clients get their voice results back as quickly as possible. We ended up opting for a more explicit approach, as shown in the following code:

private VendorResponse writeOutputStream(final VendorResponse vendorResponse, final OutputStream outputStream) {
   try (InputStream vendorInputStream = vendorResponse.getInputStream(); BufferedInputStream bufferedInputStream = new BufferedInputStream(vendorInputStream)) {
       boolean failedOutput = false;
       byte[] buffer = new byte[1024];
       int len;
       LOG.debug("Ready to write to output stream");
       while ((len = bufferedInputStream.read(buffer)) != -1) {
           if (!failedOutput) {
               try {
                   outputStream.write(buffer, 0, len);
                   outputStream.flush();
               } catch (IOException io) {
                   LOG.error("Failed to write to output stream due to exception:\n" + io.getMessage());
                   failedOutput = true;
               }
           }
       }
       LOG.debug("Output stream has been completely written");
       vendorResponse.setStreamEndTime(System.currentTimeMillis());
       vendorResponse.setSuccessful(true);
   } catch (final IOException e) {
       LOG.error(e.getMessage(), e);
       vendorResponse.setSuccessful(false);
   } finally {
       IOUtils.closeQuietly(outputStream);
   }

   return vendorResponse;
}

This code flushes the relatively small buffer every time that it fills. This ensures that the bytes are available to the client as soon as possible, which makes a huge difference with longer texts.

Lesson learned: reuse your client

While working with Amazon Polly, we hit one major roadblock that might also trip up other developers. Your most important, and perhaps least obvious, task is to make sure that your Amazon Polly client is being reused. In our initial implementation, which used code similar to the preceding example, we generated a new default client with each request, grabbed the stream, and threw it away. This approach is tempting. Client creation is so quick, why have the state?

During testing, we saw that with larger strings, our stream would abruptly terminate. We mistakenly created the client, grabbed the stream, and then lost all references to the client. For sufficiently large streams with a lot of traffic, the garbage collector came along, cleaned up the client, and broke the stream. Make sure that your client stays alive until the stream is fully processed. Without thorough testing, we might have missed this issue.

Note: If your service needs to process very large strings (over the 1500 character Amazon Polly limit), check out this blog post on batching Amazon Polly requests.

Caching

We set up our cache in Amazon Simple Storage Service (Amazon S3). There’s lots of information on using Amazon S3, so I will highlight the key points needed for this service.

Setting up the S3 bucket

After you have an S3 bucket, set up a cache directory. For our demo, we call it demo-cache. In the Amazon S3 console, choose the bucket, then choose Management, Lifecycle, and Add Lifecycle rule. Then add a new Delete rule on the demo-cache prefix. We settled on a 7-day auto-delete rule. We didn’t arrive at this number scientifically. We felt that it balanced storing the expected volume of cache hits with the pressure of storing unnecessary or stale data. Setting up your S3 bucket is one of the only setup tasks that our demo requires.

Writing the cache to S3

Writing to an S3 bucket is simple and well documented. However, it is worth noting that it’s important to make sure that all of your cache writes are valid. The simplest way to make sure that your writes are corruption-free is to use the Amazon S3 MD5 capabilities. Although this method is much more verbose than simply calling putObject(), the process in the code below will use Amazon S3’s MD5sum capabilities to verify your files were stored accurately.

public void putCacheResponse(final String text, final VoiceNames voice, final AudioFormats audioFormat, final VendorResponse vendorResponse) {
   LOG.info(String.format("Attempting to cache entry for [%s] from vendor %s", StringUtils.truncate(text, 100), vendorResponse.getVendor().toString()));
   final String keyString = createKeyString(text, voice, audioFormat);
   final byte[] responseBytes = vendorResponse.getOutputStream().toByteArray();
   final byte[] responseMD5 = getMD5DigestForBytes(responseBytes);
   final String responseMD5Base64 = DatatypeConverter.printBase64Binary(responseMD5);

   try {
       final ObjectMetadata metadata = new ObjectMetadata();
       metadata.setContentLength(responseBytes.length);
       final PutObjectResult result = s3Client.putObject(s3Bucket, keyString, new ByteArrayInputStream(responseBytes), metadata);
       if (result.getContentMd5().equals(responseMD5Base64)) {
           LOG.info(String.format("Entry for [%s] from vendor %s successfully added to S3 with key %s", StringUtils.truncate(text, 100), vendorResponse.getVendor().name(), keyString));
       } else {
           LOG.info(String.format("MD5 sum mismatch. Expected [%s] but received [%s]", responseMD5Base64, result.getContentMd5()));
           s3Client.deleteObject(s3Bucket, keyString);
       }
   } catch (Exception e) {
       LOG.error("Error occurred while trying to cache an utterance.", e);
       try {
           // Send a delete so a corrupt instance doesn't remain in S3.
           s3Client.deleteObject(s3Bucket, keyString);
       } catch (AmazonClientException e2) {
           // Don't want to duplicate logs of non-transient S3 issues.
       }
       throw e;
   }
}

private String createKeyString(final String text, final VoiceNames voice, final AudioFormats audioFormat) {
   // prefix/voice/md5.output_format
   return String.format("%s/%s/%s.%s", cachePrefix, voice.name(),
       DatatypeConverter.printBase64Binary(getMD5DigestForBytes(text.toLowerCase().getBytes())),
       audioFormat.getAwsOutputFormat().toString());
}

private byte[] getMD5DigestForBytes(byte[] bytes) {
   try {
       return MessageDigest.getInstance("MD5").digest(bytes);
   } catch (NoSuchAlgorithmException e) {
       LOG.error("MD5 algorithm not found.");
   }
   return new byte[0];
}

Fallback

Because this post is about setting up a resilient service, it would incomplete if we didn’t provide information on setting up a fallback vendor. The good news is that there are many opportunities to reuse code. Because all vendors will, at some point, need to give you an InputStream, it’s easy to abstract most of the business logic that you want to apply around your vendors. To see how little code is required per vendor, see our repository.

To control vendor failover, we decided to go with Netflix’s Hystrix. Although our code still iterates over vendors on failure, by adding a few simple ConfigurationManager properties per vendor and breaking code into HystrixCommand classes, our service can respond to poor vendor health almost effortlessly.

After a vendor fails to service a specified number of requests, the vendor is removed from rotation. After a specified period of time, Hystrix retries the vendor. If the vendor still fails to service requests, it is removed again.

Here’s a taste of how little effort is needed. We simply add the following properties for each vendor:

@PostConstruct
public void init() {
   final AbstractConfiguration configInstance = ConfigurationManager.getConfigInstance();

   vendors.forEach(vendor -> {
       configInstance.setProperty("hystrix.command." + vendor.name() + ".circuitBreaker.enabled", "true");
       configInstance.setProperty("hystrix.command." + vendor.name() + ".circuitBreaker.requestVolumeThreshold", "10");
       configInstance.setProperty("hystrix.command." + vendor.name() + ".metrics.rollingStats.timeInMilliseconds", "60000");

       // Timeout requests after 2 seconds
       configInstance.setProperty("hystrix.command." + vendor.name() + ".execution.isolation.thread.timeoutInMilliseconds", "2000");

       // Don't attempt to re-close the circuit for at least 30 seconds
       configInstance.setProperty("hystrix.command." + vendor.name() + ".circuitBreaker.sleepWindowInMilliseconds", "30000");
   });
}

Then we break our work into HystrixCommand objects:

public class VendorCommand extends HystrixCommand<VendorResponse> {
   // Fields omitted
   protected VendorCommand(final Vendor vendor, final String text, final VoiceNames voiceName,
                           final AudioFormats audioFormat, final VendorSao vendorSao) {
       super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("Text2Speech"))
                   .andCommandKey(HystrixCommandKey.Factory.asKey(vendor.name())));
        // Fields
   }

   @Override
   protected VendorResponse getFallback() {
       LOG.warn("Returning fallback for " + vendor.name());
       // Return null so we try the next vendor in the list
       return null;
   }

   @Override
   protected VendorResponse run() {
       // Removed run code for brevity
   }
}

For brevity, I omitted details on how the class runs. You can see the full classes in the repository. Adding the preceding properties and using this simple runnable class structure allows our service to gracefully take vendors in and out of rotation as they experience issues without developer intervention. This simple framework, which abstracts a lot of state and developer effort, simplifies managing multiple vendors.

Conclusion

Migrating to a new service can be daunting. In this post, I show how Bandwidth built a small, reliable, and fast TTS application that is backed by Amazon Polly. We are now using high-quality voices at low cost. The developer effort required to build this new service was surprisingly minimal.

I’ve also shown how to build your own service and provided a demo service that you can use out of the box to start. If you’re still using older TTS software with low-quality voices, I hope you’re now asking yourself “Why?”

Additional Reading

Take your skills to the next level. Learn how to build your own text-to-speech applications with Amazon Polly.