Build Your Own Text-to-Speech Applications with Amazon Polly

by Tomasz Stachlewski | on | | Comments

In general, speech synthesis isn’t easy.  You can’t just assume that when an application reads each letter of a sentence that the output will make sense. A few common challenges for text-to-speech applications include:

  • Words that are written the same way, but that are pronounced differently: I live in Las Vegas. vs. This presentation broadcasts live from Las Vegas.
  • Text normalization. Disambiguating abbreviations, acronyms, and units: St., which can be expanded as street or saint.
  • Converting text to phonemes in languages with complex mapping, such as, in English, tough, through, though. In this example, similar parts of different words can be pronounced differently depending on the word and context.
  • Foreign words (déjà vu), proper names (François Hollande), slang (ASAP, LOL), etc.

Amazon Polly provides speech synthesis functionality that overcomes those challenges, allowing you to focus on building applications that use text-to-speech instead of addressing interpretation challenges.

Amazon Polly turns text into lifelike speech. It lets you create applications that talk naturally, enabling you to build entirely new categories of speech-enabled products. Amazon Polly is an Amazon AI service that uses advanced deep learning technologies to synthesize speech that sounds like a human voice. It currently includes 47 lifelike voices in 24 languages, so you can select the ideal voice and build speech-enabled applications that work in many different countries.

In addition, Amazon Polly delivers the consistently fast response times required to support real-time, interactive dialog. You can cache and save Polly’s audio files for offline replay or redistribution. (In other words, what you convert and save is yours. There are no additional text-to-speech charges for using the speech.) And Polly is easy to use. You simply send the text you want to convert into speech to the Amazon Polly API. Amazon Polly immediately returns the audio stream to your application so that your application can play it directly or store it in a standard audio file format, such as an MP3.

In this blog post, we create a basic, serverless application that uses Amazon Polly to convert text to speech. The application has a simple user interface that accepts text in many different languages and then converts it to audio files, which you can play from a web browser. We’ll use blog posts, but you can use any type of text. For example, you can use the application to read recipes while you are preparing a meal, or news articles or books while you’re driving or riding a bike.

The application’s architecture

The following diagram shows the application architecture. It uses a serverless approach, which means that we don’t need to work with servers – no provisioning, no patching, no scaling. The Cloud automatically takes care of this, allowing us to focus on our application.

The application provides two methods – one for sending information about a new post, which should be converted into an MP3 file, and one for retrieving information about the post (including a link to the MP3 file stored in an S3 bucket). Both methods are exposed as RESTful web services through Amazon API Gateway. Let’s look at how the interaction works in the application.


AI Tech Talk: How to Get the Most Out of Amazon Polly, a Text-to-Speech Service

by Victoria Kouyoumjian | on | | Comments


Although there are many ways to optimize the speech generated by Amazon Polly‘s text-to-speech voices, new customers may find it challenging to quickly learn how to apply the most effective enhancements in each situation. The objective of this webinar is to educate customers about all of the ways in which they can modify the speech output, and to learn some insider tips to help them get the most out of the Polly service. This webinar will provide a comprehensive overview of the available tools and techniques available for modifying Polly speech output, including SSML tags, lexicons, and punctuation. Other topics will include recommendations for streamlining the process of applying these techniques, and how to provide feedback that the Polly team can use to continually improve the quality of voices for you.

Learning Objectives

  • Build a simple speech-enabled app with Polly’s text-to-speech voices.
  • Learn about the complete set of available SSML tags, and how you can apply them in order to modify and enhance your speech output.
  • Learn how you can override the default Polly pronunciation for specific words, by creating a lexicon of these words, along with the pronunciation that matches your needs.
  • Learn about how you can use punctuation to modify the way text is spoken by Polly voices.
  • Get insider tips on the best speech optimization techniques to apply to each of the most common speech production concerns.
  • Discover ways to streamline the process of getting the most out of Polly voices through SSML tags and lexicons.
  • Find out the best way to submit your feedback on Polly voices, pronunciation, and the available feature set, so that we can continue to improve this service for you!

Monday, March 27, 2017 9:00 AM PDT – 10:00 AM PDT

Learn More and Register

Updated AWS CloudFormation Deep Learning Template Adds New Features and Capabilities

by Naveen Swamy | on | | Comments

AWS CloudFormation, which creates and configures Amazon Web Services resources with a template, simplifies the process of setting up a distributed deep learning cluster. The AWS CloudFormation Deep Learning template uses the latest updated Amazon Deep Learning AMI (which provides Apache MXNet, TensorFlow, Caffe, Theano, Torch, and CNTK frameworks) to launch a cluster of EC2 instances and other AWS resources needed to perform distributed deep learning. AWS CloudFormation creates all resources in the customer account.

We’ve updated the AWS CloudFormation Deep Learning template with exciting additional capabilities including automation to dynamically adjust the cluster to the maximum number of available worker instances when an instance fails to provision (perhaps due to reached limit). This template also lets you choose between GPU and CPU instance types as well as adds support to run under either Ubuntu or Amazon Linux environments for your cluster. We’ve also added the ability to provision a new, or attach an existing Amazon EFS file system to your cluster to let you easily share code/data/logs and results.

To learn more, visit the AWS Labs – Deep Learning GitHub repo and follow the tutorial, where we show how easy it is to run distributed training on AWS using the MXNet and TensorFlow frameworks.

Use Amazon Rekognition to Build an End-to-End Serverless Photo Recognition System

by Vladimir Budilov | on | | Comments

Imagine you work for a marketing agency that has tens of thousands of stock images. You find that many images don’t have descriptive file names and others are completely mislabeled. You don’t want to spend hours and hours relabeling them and moving them around to different folders. But what if you could find the images you need without relying on metadata?  In this blog post, we will review an end-to-end solution to show you how to do this using Amazon Rekognition.

Amazon Rekognition is a service that makes it easy to add image analysis to your applications. With Rekognition, you can detect objects, scenes, and faces in images. You can also search and compare faces. Rekognition’s API lets you quickly add sophisticated deep learning-based visual search and image classification to your applications.

In this post, we’ll focus on searching for objects and scenes in images. A future post will focus on searching for faces.

The solution requires three general steps:

  1. Adding images
  2. Searching for images
  3. Removing images

Below is a general outline of each step so that you can get a mental picture of how the process works. These sections are followed by a script that automates these steps for you so that you can quickly try out the use case.

Adding an image

First, you authenticate with Amazon Cognito. Then you upload the image to a bucket on Amazon S3 using the AWS CLI or a custom app. The S3 Bucket has an ObjectCreated AWS Lambda event that passes the object key and bucket name to the AWS Lambda function. The function calls Rekognition’s Object and Scene detection API. After Rekognition returns the labels describing the picture, the Lambda function saves the image’s S3 location along with the retrieved meta-data to an Amazon Elasticsearch domain.



AWS Collaborates With the National Science Foundation to Foster Innovation

by Sanjay Padhi | on | | Comments


Amazon Web Services and the National Science Foundation (NSF) are collaborating to foster innovation in big data research. Under the AWS Research Initiative (ARI) program, AWS and NSF will respectively support innovative research in the field of Big Data. With the advancements of techniques and technologies such as cloud-based Artificial Intelligence, Machine Learning, Big Data analytics and High-Performance Computing, ARI BIGDATA grants will help researchers maximize the value of their NSF grants to accelerate the pace of innovation. A total of $26.5 million from the NSF, in addition to $3M in cloud credits from AWS, will be awarded to between 27 and 35 projects, for a span of 3 to 4 years, subject to availability of funding.

Research in foundational and innovative applications can benefit from AWS’s latest services including deep learning tools, frameworks for both supervised and unsupervised learning, natural language understanding, automatic speech recognition, visual search and image recognition, text-to-speech (TTS), and various other machine learning (ML) technologies. AWS’s latest FPGA processors, Amazon EC2 F1 instances, can be extremely useful in streaming and resident big data analytics, by providing processing speeds by order(s) in magnitude in comparison with traditional computational and/or graphics processors.

Researchers in universities and colleges, nonprofits, non-academic organizations, and state and local government can apply. Requests for AWS cloud credits must adhere to a 70-30 split in funding between the requested NSF funds and the requested cloud services, respectively. The submission window for applications is March 15 – 22, 2017.

To learn how to apply, see the NSF’s submission page. Researchers may submit applications via the NSF FastLane System or

Month in Review: February 2017

by Derek Young | on | | Comments

The AWS AI Blog launched in February! Take a look at our summaries below and learn, comment, and share. Thanks for reading!


Welcome to the New AWS AI Blog!
If you ask 100 people for the definition of “artificial intelligence,” you’ll get at least 100 answers, if not more. At AWS, we define it as a service or system which can perform tasks that usually require human-level intelligence such as visual perception, speech recognition, decision making, or translation. In this post, learn more about the new AWS AI Blog.

The AWS Deep Learning AMI, Now with Ubuntu
In this blog post, learn how to get started with the AWS Deep Learning AMI for Ubuntu that is now available in the AWS Marketplace in addition to the Amazon Linux version. The AWS Deep Learning AMI lets you run deep learning in the Cloud, at any scale.

Classify a Large Number of Images with Amazon Rekognition and AWS Batch
Many AWS customers who have images stored in S3 want to query for objects and scenes detected by Rekognition. This post describes how to use AWS services to create an application called bucket-rekognition-backfill that cost effectively gets and stores the Rekognition labels (objects, scenes, or concepts) from images stored in an S3 bucket, and processes images that are added to, updated, and deleted from the bucket.

AWS Podcast #175: Artificial Intelligence with Dr. Matt Wood
Matt Wood sat down with Simon Elisha from the AWS Podcast to talk about the emerging world of artificial intelligence. In addition to speaking about Amazon Lex, Amazon Polly, Amazon Rekognition, and Apache MXNet, they also do a little reminiscing about days gone by.

Building Better Bots (Part 1) and Building Better Bots (Part 2)
Amazon Lex is a service that allows developers to build conversational interfaces for voice and text into applications. With Amazon Lex, the same deep learning technologies that power Amazon Alexa are now available to any developer, so you can quickly and easily build sophisticated, natural language conversational bots (chatbots). Amazon Lex’s advanced deep learning technology provides automatic speech recognition (ASR), for converting speech to text, and natural language understanding (NLU), to recognize the intent of text, so you can build applications with a highly engaging user experience.


Predicting Customer Churn with Amazon Machine Learning

by Denis V. Batalov | on | | Comments

Note: This post has a companion talk that was delivered at AWS re:Invent 2016.

Losing customers is costly for any business. Identifying unhappy customers early on gives you a chance to offer them incentives to stay. This post describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so my post is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.

I use an example of churn that is familiar to all of us–leaving a mobile phone operator.  Seems like I can always find fault with my provider du jour! And if my provider knows that I’m thinking of leaving, it can offer timely incentives–I can always use a phone upgrade or perhaps have a new feature activated–and I might just stick around. Incentives are often much more cost effective than losing and reacquiring a customer.

Churn dataset

Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes–after all, predicting the future is tricky business! But I’ll also show how to deal with prediction errors.

The dataset I use is publicly available and was mentioned in the book “Discovering Knowledge in Data” by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets, and can be downloaded from the author’s website here in .csv format.

By modern standards, it’s a relatively small dataset, with only 3,333 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:

  • State: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
  • Account Length: the number of days that this account has been active
  • Area Code: the three-digit area code of the corresponding customer’s phone number
  • Phone: the remaining seven-digit phone number
  • Int’l Plan: whether the customer has an international calling plan: yes/no
  • VMail Plan: whether the customer has a voice mail feature: yes/no
  • VMail Message: presumably the average number of voice mail messages per month
  • Day Mins: the total number of calling minutes used during the day
  • Day Calls: the total number of calls placed during the day
  • Day Charge: the billed cost of daytime calls
  • Eve Mins, Eve Calls, Eve Charge: the billed cost for calls placed during the evening
  • Night Mins, Night Calls, Night Charge: the billed cost for calls placed during nighttime
  • Intl Mins, Intl Calls, Intl Charge: the billed cost for international calls
  • CustServ Calls: the number of calls placed to Customer Service
  • Churn?: whether the customer left the service: true/false

The last attribute, Churn?, is known as the target attribute–the attribute that we want the ML model to predict. Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification.

Amazon Machine Learning

The simplest way to build the model is to use the AWS ML service, Amazon Machine Learning (Amazon ML), using the binary classification model. First, I prepare and analyze the training dataset, then I create the model, and finally I evaluate the model to decide whether I can use it.

Preparing the training data

With Amazon ML, I need to first create a datasource representing the training dataset. I upload the corresponding CSV file to Amazon S3 and use the resulting S3 URL in constructing the datasource. (To learn how to do this, see AWS documentation for Using Amazon S3 with ML.)

Before doing that, I changed the column names in the .csv file slightly, eliminating special characters and replacing spaces with underscores. I also removed the trailing ‘.’ from each line so that Amazon ML can recognize False and True as proper binary values.

I use the Create ML Model wizard in the Amazon ML console to create the datasource. I specify that the first line of the .csv file contains the column names. Generally, Amazon ML automatically infers the data types of attributes, distinguishing between Binary, Categorical, Numeric, and Text attributes. You can correct incorrectly inferred types.


Deep Learning AMI release v2.0 now Available for Amazon Linux

by Dominic Divakaruni | on | | Comments

You can now use upgraded versions of Apache MXNet, TensorFlow, CNTK, and Caffe, on the AWS Deep Learning AMI v2.0 for Amazon Linux, including Keras, available in the AWS Marketplace.

The Deep Learning AMI v2.0 for Amazon Linux is designed to continue to provide a stable, secure, and high performance execution environment for deep learning applications running on Amazon EC2. The latest MXNet release (v0.9.3) included with this AMI v2.0 adds several enhancements including a faster new image processing API that enables parallel processing, improved multi GPU performance and support for new operators. This AMI includes the following framework versions:  MXNet: v0.9.3; Tensorflow v1.0.0; Caffe release: rc5; Theano: rel 0.8.2; Keras: 1.2.2; CNTK: v2.0 beta12.0; and Torch: master branch.

The Deep Learning AMI includes Jupyter notebooks with Python 2.7 and Python 3.4 kernels, Matplotlib, Scikit-image, CppLint, Pylint, pandas, Graphviz, Bokeh Python packages, Boto and Boto 3 and the AWS CLI. The DL AMI also comes packaged with Anaconda 2 and Anaconda 3 Data Science platform.

AI Tech Talk: Introducing Amazon Lex – A Service for Building Voice or Text Chatbots

by Victoria Kouyoumjian | on | | Comments


Amazon Lex is a service for building conversational interfaces into any application using voice and text. Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable you to build applications with highly engaging user experiences and lifelike conversational interactions.

Learning Objectives:

  • Learn about the capabilities and features of Amazon Lex
  • Learn about the benefits of Amazon Lex
  • Learn about the different use cases
  • Learn how to get started using Amazon Lex
  • Learn how to build a bot!

Who Should Attend: Developers who want to build chatbots

Monday, March 6, 2017 9:00 AM PST – 10:00 AM PST

Learn More and Register

Building Better Bots Using Amazon Lex (Part 2)

by Niranjan Hira and Harshal Pimpalkhute | on | | Comments

In Part 1 we reviewed some elementary bot design considerations, built the Amazon Lex CoffeeBot chatbot, and used the Amazon Lex Test console to confirm that CoffeeBot reacted to text input as expected.  In this post, we make some more design decisions to take CoffeeBot to the next level, including voice interaction.

Note: The code for for Part 1 and Part 2 is located in our Github repo.


Let’s review where we left off in our CoffeeBot development.  The user started by asking for a small mocha.  Amazon Lex determined that the two required slots had been filled and presented the confirmation prompt with its first response.  Then the user changed her mind and Amazon Lex politely accepted the change.


Now consider that confirmation prompt, “You’d like me to order a large mocha.  Is that right?” The familiar web check-out model might tempt us to try a prompt that suggests the user “continue shopping.” Something like, “We’re ready to order your large mocha.  Would you like something else?”.