AWS Machine Learning Blog

Using Amazon Polly to Provide Real-Time Home Monitoring Alerts

This is a guest blog post by Siva K. Syamala, Senior Developer from Y-cam Solutions. In their own words, “Y-cam is a provider of high quality security video solutions, our vision is to make smart home security easy and accessible to all.”

Home security is a very important constituent in home automation and the use of the Internet of Things. Y-cam Solutions Limited, with the help of Amazon as a backbone, has delivered a smart security system that can be monitored and controlled from anywhere in the world with a smart phone. To improve the alerts, notifications, and the way to control the system, Y-cam uses Amazon Polly to provide a first class AI service where the user interacts with the security system through speech.

How our service works

When the alarm is triggered, we notify our customers with a voice call through Twilio. After the call is established, Twilio steps through the TwiML instructions and uses synthesized speech retrieved from Amazon Polly to start streaming to the customer. Call recipients respond by pressing buttons on their mobile phone keypad (DTMF codes). Depending on the DTMF codes, our service takes the specified action and returns the TwiML instructions for synthesized speech retrieval from Amazon Polly. To sound like a realistic conversation, it’s essential that Amazon Polly responds quickly. Delays and waiting can cause frustration and increase the likelihood of the recipient hanging up.

Below is a sample audio clip of a phone call to customer when an alarm is triggered.

 

Architecture

 

Calling Amazon Polly

The following Java code shows requesting the synthesized speech from Amazon Polly and storing it in an S3 bucket.

public String convertTextToSpeech(final String text, final String polyVoiceId) {
	log.info("Converting " + text + " to speech");
	// Create speech synthesis request.
	SynthesizeSpeechRequest synthesizeSpeechRequest = new SynthesizeSpeechRequest()
	.withText(text)
	.withVoiceId(polyVoiceId)
	.withOutputFormat(OutputFormat.Mp3);

	// Get the synthesized speech audio stream.
	SynthesizeSpeechResult synthesizeSpeechResult = awsPollyClient.synthesizeSpeech(synthesizeSpeechRequest);
		
	// store audio stream of Polly to S3 as an MP3 file
	byte[] bytes = null;
	try {
		bytes = IOUtils.toByteArray(synthesizeSpeechResult.getAudioStream());
	} catch (IOException e) {
		log.error("Could not get bytes from the audio stream " + e.getMessage());
	}
		
	ObjectMetadata omd = new ObjectMetadata();
	omd.setContentType(synthesizeSpeechResult.getContentType());
	omd.setContentLength(bytes.length);
	ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);
	String fileName = getRandomString();
	final PutObjectRequest s3Put = new PutObjectRequest(pollySpeechBucket, fileName, byteArrayInputStream, omd).withCannedAcl(CannedAccessControlList.PublicRead);

	amazonS3Client.putObject(s3Put);

	return S3URL;
}

 

Why Amazon Polly?

Before using Amazon Polly, we used a different TTS provider that had an unrealistic voice and problems with scalability. Clearly, a robotic voice won’t be a good customer experience. We want the voice to be more natural and human. Amazon Polly provided us with a very simple, flexible, natural and scalable text-to-speech solution at low cost. Also, Amazon Polly supports different voices and languages. Amazon Polly could process data in milliseconds so that our customers don’t need to wait for a long time for the response.

Future developments

We are planning to use Amazon Lex in the future, so that customers can issue controlled commands to their home security system instead of pressing DTML codes. Amazon Lex provides the deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text. The goal is to enable a full-voice user interface for our users.