A/B Testing at Scale – Amazon Machine Learning Research

by Guy Ernest | on | Permalink | Comments |  Share

This week, Amazon presented an academic paper at KDD 2017, the prestigious machine learning and big data conference. The paper shows Amazon’s research into tools that help us measure customers’ satisfaction and better learn how we can implement ideas that delight them. Specifically, we show an efficient bandit algorithm for multivariate testing, where one seeks to find an optimal series of actions with as little experimental effort as possible. One application of this research, for example, is optimizing the layout of a web page.

Please check out this fun, three-minute video that explains the paper and how the ideas are applied within Amazon. Also, it won the KDD 2017 Audience Appreciation Award!

Download the paper from An efficient bandit algorithm for realtime multivariate optimization.

Apache MXNet Release Candidate Introduces Support for Apple’s Core ML and Keras v1.2

by Cynthya Peranandam | on | Permalink | Comments |  Share

Apache MXNet is an effort undergoing incubation at the Apache Software Foundation (ASF). Last week, the MXNet community introduced a release candidate for MXNet v0.11.0, its first as an incubating project, and the community is now voting on whether to accept this candidate as a release. It includes the following major feature enhancements:

  • A Core ML model converter that allows you to train deep learning models with MXNet and then deploy them easily to Apple devices
  • Support for Keras v1.2 that enables you to use the Keras interface with MXNet as the runtime backend when building deep learning models

The v0.11.0 release candidate also includes additional feature updates, performance enhancements, and fixes as outlined in the release notes.

Run MXNet models on Apple devices using Core ML (developer preview)

This release includes a tool that you can use to convert MXNet deep learning models to Apple’s Core ML format. Core ML is a framework that application developers can use for deploying machine learning models onto Apple devices with minimal memory footprint and power consumption. It uses the Swift programming language and is available on the Xcode integrated development environment (IDE). It allows developers to interact with machine learning models like any other Swift object class.

With this conversion tool, you now have a fast pipeline for your deep learning enabled applications. Move from scalable and efficient distributed model training in the cloud using MXNet to fast runtime inference on Apple devices. This developer preview of the Core ML model converter includes support for computer vision models. For more details about the converter, see the incubator-mxnet GitHub repo.


Build Your Own Face Recognition Service Using Amazon Rekognition

by Christian Petters | on | Permalink | Comments |  Share

Amazon Rekognition is a service that makes it easy to add image analysis to your applications. It’s based on the same proven, highly scalable, deep learning technology developed by Amazon’s computer vision scientists to analyze billions of images daily for Amazon Prime Photos. Facial recognition enables you to find similar faces in a large collection of images.

In this post, I’ll show you how to build your own face recognition service by combining the capabilities of Amazon Rekognition and other AWS services, like Amazon DynamoDB and AWS Lambda. This enables you to build a solution to create, maintain, and query your own collections of faces, be it for the automated detection of people within an image library, building access control, or any other use case you can think of.

If you want to get started quickly, launch this Cloudformation template to get started now. For the manual walkthrough, please ensure that you replace resource names with your own values.

How it works

The following figure shows the application workflow. It’s separated into two main parts:

  • Indexing (blue flow) is the process of importing images of faces into the collection for later analysis.
  • Analysis (black flow) is the process of querying the collection of faces for matches within the index.


Before we can start to index the faces of our existing images, we need to prepare a couple of resources.

We start by creating a collection within Amazon Rekognition. A collection is a container for persisting faces detected by the IndexFaces API. You might choose to create one container to store all faces or create multiple containers to store faces in groups.

Your use case will determine the indexing strategy for your collection, as follows: 

  • Face match. You might want to find a match for a face within a collection of faces (as in our current example). Face match can support a variety of use cases. For example, whitelisting a group of people for a VIP experience, blacklisting to identify bad actors, or supporting logging scenarios. In those cases, you would create a single collection that contains a large number of faces or, in the case of the logging scenario, one collection for a certain time period, such as a day. 
  • Face verification. In cases where a person claims to be of a certain identity, and you are using face recognition to verify the identity (for example, for access control or authentication), you would actually create one collection per person. You would store a variety of face samples per person to improve the match rate. This also enables you to extend the recognition model with samples of different appearances, for example, where a person has grown a beard. 
  • Social tagging. In cases where you might like to automatically tag friends within a social network, you would employ one collection per application user. 


Estimating the Location of Images Using MXNet and Multimedia Commons Dataset on AWS EC2

by Jaeyoung Choi and Kevin Li | on | Permalink | Comments |  Share

This is a guest post by Jaeyoung Choi of the International Computer Science Institute and Kevin Li of the University of California, Berkeley. This project demonstrates how academic researchers can leverage our AWS Cloud Credits for Research Program to support their scientific breakthroughs.

Modern mobile devices can automatically assign geo-coordinates to images when you take pictures of them. However, most images on the web still lack this location metadata. Image geo-location is the process of estimating the location of an image and applying a location label. Depending on the size of your dataset and how you pose the problem, the assigned location label can range from the name of a building or landmark to an actual geo-coordinate (latitude, longitude).

In this post, we show how to use a pre-trained model created with Apache MXNet to geographically categorize images. We use images from a dataset that contains millions of Flickr images taken around the world. We also show how to map the result to visualize it.

Our approach

The approaches to image geo-location can be divided into two categories: image-retrieval-based search approaches and classification-based approaches. (This blog post compares two state-of-the-art approaches in each category.)

Recent work by Weyand et al. posed image geo-location as a classification problem. In this approach, the authors subdivided the surface of the earth into thousands of geographic cells and trained a deep neural network with geo-tagged images. For a less technical description of their experiment, see this article.

Because the authors did not release their training data or their trained model, PlaNet, to the public, we decided to train our own image geo-locator. Our setup for training the model is inspired by the approach described in Weyand et al., but we changed several settings.

We trained our model, LocationNet, using MXNet on a single p2.16xlarge instance with geo-tagged images from the AWS Multimedia Commons dataset.

We split training, validation, and test images so that images uploaded by the same person do not appear in multiple sets. We used Google’s S2 Geometry Library to create classes with the training data. The model converged after 12 epochs, which took about 9 days with the p2.16xlarge instance. A full tutorial with a Jupyter notebook is available on GitHub.

The following table compares the setups used to train and test LocationNet and PlaNet.

             LocationNet PlaNet
Dataset source Multimedia Commons Images crawled from the web
Training set 33.9 million 91 million
Validation 1.8 million 34 million
S2 Cell Partitioning t1=5000, t2=500
→ 15,527 cells
t1=10,000, t2=50
→ 26,263 cells
Model ResNet-101 GoogleNet
Optimization SGD with Momentum and LR Schedule Adagrad
Training time 9 days on 16 NVIDIA K80 GPUs (p2.16xlarge EC2 instance),
12 epochs
2.5 months on 200 CPU cores
Framework MXNet DistBelief
Test set Placing Task 2016 Test Set (1.5 million Flickr images) 2.3 M geo-tagged Flickr images


Analyze Emotion in Video Frame Samples Using Amazon Rekognition on AWS

by Cyrus Wong | on | Permalink | Comments |  Share

This guest post is by AWS Community Hero Cyrus Wong. Cyrus is a Data Scientist at the Hong Kong Vocational Education (Lee Wai Lee) Cloud Innovation Centre. He has achieved all 7 AWS Certifications and enjoys sharing his AWS knowledge with others through open-source projects, blog posts, and events.

HowWhoFeelInVideo is an application that analyzes faces detected in sampled video clips to interpret the emotion or mood of the subjects .  It identifies faces, analyzes the emotions displayed on those faces, generates corresponding Emoji overlays on the video, and logs emotion data. The application accomplishes all of this within a serverless architecture using Amazon Rekognition, AWS Lambda, AWS Step Functions, and other AWS services.

HowWhoFeelInVideo was developed as part of a research project at the Hong Kong Vocational Education (Lee Wai Lee) Cloud Innovation Centre.  The project is focused on childcare, elder care, and community services. However, emotion analysis can be used in many areas, including rehabilitative care, nursing care, and applied psychology. My initial focus has been on applying this technology to the classroom.

In this post, I explain how HowWhoFeelInVideo works and how to deploy and use it.

How it works

Teachers, such as myself, can use HowWhoFeelInVideo to get an overall measure of a student’s mood (e.g., happy, or calm, or confused) while taking attendance. The instructor can use this data to adjust his or her focus and approach to enhance the teaching experience. This research project is just beginning. I will update this post after I receive additional results.

To use HowWhoFeelInVideo, a teacher sets up a basic classroom camera to take each students’ attendance using face identification. The camera also captures how students feel during class. Teachers can also use HowWhoFeelInVideo to prevent students from falsely reporting attendance.

Architecture and design

HowWhoFeelInVideo is a serverless application built using AWS Lambda functions. Five of the Lambda functions are included in the HowWhoFeelInVideo state machine. AWS Step Functions streamlines coordinating the components of distributed applications and microservices using visual workflows. This simplifies building and running multi-step applications.

The HowWhoFeelInVideo state machine starts with the startFaceDetectionWorkFlowLambda function, which is triggered by an Amazon S3 PUT object event. startFaceDetectionWorkFlowLambda passes in the following information into the execution:

    "bucket": "howwhofeelinvideo",
    "key": "Test2.mp4"


Exploiting the Unique Features of the Apache MXNet Deep Learning Framework with a Cheat Sheet

by Sunil Mallya | on | Permalink | Comments |  Share

Apache MXNet (incubating) is a full-featured, highly scalable deep learning framework that supports creating and training state-of-the-art deep learning models. With it, you can create convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and others. It supports a variety of languages, including, but not limited to, Python, Scala, R, and Julia.

In this post, we showcase some unique features that make MXNet a developer friendly framework in the AWS Cloud. For developers who prefer symbolic expression, we also provide a cheat sheet for coding neural networks with MXNet in Python. The cheat sheet simplifies onboarding to MXNet. It’s also a handy reference for developers who already use the framework.

Multi-GPU support in a single line of code

The ability to run on multiple GPUs is a core part of the MXNet architecture. All you need to do is pass a list of devices that you want to train the model on. By default, MXNet uses data parallelism to partition the workload over multiple GPUs. For example, if you have 3 GPUs, each one receives a copy of the complete model and trains it on one-third of each training data batch.

import mxnet as mx 
# Single GPU
module = mx.module.Module(context=mx.gpu(0))

# Train on multiple GPUs
module = mx.module.Module(context=[mx.gpu(i) for i in range(N)], ...)

Training on multiple computers

MXNet is a distributed deep learning framework designed to simplify training on multiple GPUs on a single server or across servers. To train across servers, you need to install MXNet on all computers, ensure that they can communicate with each other over SSH, and then create a file that contains the server IPs.

$ cat hosts
python ../../tools/ -n 2 --launcher ssh -H hosts python --network lenet --kv-store dist_sync

MXNet uses a key-value store to synchronize gradients and parameters between machines. This allows you to perform distributed training, and makes sure that MXNet is compiled using USE_DIST_KVSTORE=1.

Custom data iterators and iterating data is stored in Amazon S3

In MXNet, data iterators are similar to Python iterator objects, except that they return a batch of data as a DataBatch object that contains “n” training examples along with corresponding labels. MXNet has prebuilt, efficient data iterators for common data types like NDArray and CSV. It also has a binary format for efficient I/O on distributed file systems, like HDFS. You can create custom data iterators by extending the class. For information on how to implement this feature, see this tutorial.

Amazon Simple Storage Service (Amazon S3) is a popular choice for customers who need to store large amounts of data at very low cost. In MXNet, you can create iterators that reference the data stored in Amazon S3 in RecordIO, ImageRecordIO, CSV, or NDArray formats without needing to explicitly download the data to disk.

data_iter =     
     data_shape=(3, 227, 227),


Create a Serverless Solution for Video Frame Analysis and Alerting

by Moataz Anany | on | Permalink | Comments |  Share

Imagine capturing frames off of live video streams, identifying objects within the frames, and then triggering actions or notifications based on the identified objects. Now imagine accomplishing all of this with low latency and without a single server to manage

In this post, I present a serverless solution that uses Amazon Rekognition and other AWS services for low-latency video frame analysis. The solution is a prototype that captures a live video, analyzes it contents, and sends an alert when it detects a certain object. I walk you through the solution’s architecture and explain how the AWS services are integrated. I then give you the tools that you need to configure, build, and run the prototype. Finally, I show you the prototype in action.

Our use case

The prototype addresses a specific use case: alerting when a human appears in a live video feed from an IP security camera. At a high level, it works like this:

  1. A camera surveils a particular area, streaming video over the network to a video capture client.
  2. The client samples video frames and sends them to AWS services, where they are analyzed and stored with metadata.
  3. If Amazon Rekognition detects a certain object—in this case, a human—in the analyzed video frames, an AWS Lambda function sends an Amazon Simple Message Service (Amazon SMS) alert.
  4. After you receive an SMS alert, you will likely want to know what caused it. For that, the prototype displays sampled video frames with low latency in a web-based user interface.

How you define low latency depends on the nature of the application. Low latency can range from microseconds to a few seconds. If you use a camera for surveillance, as in our prototype, the time between the capture of unusual activity and the triggering of an alarm can be a few seconds and still be considered a low-latency response. That’s without special performance tuning.

Solution architecture 

To understand the solution’s architecture, let’s trace the journey of video frames.  In the following architecture diagram, an arrow represents a step done by an element in the architecture. An arrow starts at the element initiating the step. It ends at an element used in the step.


The AWS Deep Learning AMI for Ubuntu is Now Available with CUDA 8, Ubuntu 16, and the Latest Versions of Deep Learning Frameworks

by Cynthya Peranandam | on | Permalink | Comments |  Share

The AWS Deep Learning AMI lets you build and scale deep learning applications in the cloud, at any scale. The AMI comes pre-installed with popular deep learning frameworks, to let you to train sophisticated, custom AI models, experiment with new algorithms, or to learn new skills and techniques. The latest release of the AWS Deep Learning AMI Ubuntu Version includes several notable updates that help accelerate development of high-performance algorithms. The AMI now comes bundled with the Ubuntu 16.04 base image, NVIDIA CUDA 8 drivers, and the following deep learning framework versions:

  • MXNet v0.10.0.post1
  • TensorFlow v1.2.0
  • Theano 0.9.0
  • Caffe v1.0
  • Caffe2 v0.7.0
  • Keras 1.2.2
  • CNTK v2.0
  • Torch (master branch)

The AMI Ubuntu Version is now live in seven AWS Regions:

  • us-east-2 (Ohio)
  • us-east-1 (N. Virginia)
  • us-west-2 (Oregon)
  • us-west-1 (N. California)
  • ap-southeast-2 (Sydney)
  • ap-northeast-1 (Tokyo)
  • ap-northeast-2 (Seoul)

Both the Ubuntu and Amazon Linux versions of the AMI are now easier to locate in the EC2 console. As you configure your instance, you can choose an AMI in the Quick Start in Step 1, which lists commonly used instances.


Train Neural Machine Translation Models with Sockeye

by Felix Hieber and Tobias Domhan | on | Permalink | Comments |  Share

Have you ever wondered how you can use machine learning (ML) for translation? With our new framework, Sockeye, you can model machine translation (MT) and other sequence-to-sequence tasks. Sockeye, which is built on Apache MXNet, does most of the heavy lifting for building, training, and running state-of-the-art sequence-to-sequence models.

In natural language processing (NLP), many tasks revolve around solving sequence prediction problems. For example, in MT, the task is predicting a sequence of translated words, given a sequence of input words. Models that perform this kind of task are often called sequence-to-sequence models. Lately, deep neural networks (DNNs) have significantly advanced the performance of these models. Sockeye provides both a state-of-the-art implementation of neural machine translation (NMT) models and a platform to conduct NMT research.

Sockeye is built on Apache MXNet, a fast and scalable deep learning library. The Sockeye codebase leverages unique features from MXNet. For example, it mixes declarative and imperative programming styles through the symbolic and imperative MXNet APIs. It also uses data parallelism to train models on multiple GPUs.

In this post, we provide an overview of NMT, and then show how to use Sockeye to train a minimal NMT model with attention.

How sequence-to-sequence models with attention work

To understand what’s going on under the hood in Sockeye, let’s take a look at the neural network architecture that many academic groups and industry commonly use.

The network has three major components: the encoder, the decoder, and the attention mechanism. The encoder reads the source sentence one word at a time until the end of sentence (<EOS>) and produces a hidden representation of the sentence. The encoder is often implemented as a recurrent neural network (RNN), such as a long short-term memory (LSTM) network.

The decoder, which is also implemented as an RNN, produces the target sentence one word at a time, starting with a beginning-of-sentence symbol (<BOS>). It has access to the source sentence through an attention mechanism that generated a context vector. Using the attention mechanism, the decoder can decide which words are most relevant for generating the next target word. This way, the decoder has access to the entire input sentence at all times.

The next word that the network generates becomes an input to the decoder. The decoder produces the subsequent word based on the generated word and its hidden representation. The network continues generating words until it produces a special end-of-sentence symbol, <EOS>.


Building a Reliable Text-to-Speech Service with Amazon Polly

by Yiannis Philipopoulos | on | Permalink | Comments |  Share
Listen to this post

Voiced by Amazon Polly

This is a guest post by Yiannis Philipopoulos, a Software Developer at Bandwidth. In Yiannis’ words: “Bandwidth’s solutions are shaping the future of how we connect with voice and messaging for mobile apps and large-scale, enterprise-level solutions. At the core of Bandwidth’s business-grade Communications Platform as a Service (CPaaS) offering are communication APIs that allow companies to launch and scale next-generation apps and solutions using the nation’s largest VoIP network.”

Text-to-speech (TTS) technology is evolving rapidly. Thanks to machine learning, computers’ ability to disambiguate text and combine individual sounds into natural-sounding whole words has improved dramatically. Although Amazon Polly provides excellent TTS at low cost, many still use older TTS technologies because they believe that upgrading to a new system isn’t worth the effort.

Bandwidth’s customers use TTS primarily to vocalize menus, reminders, and order information. Bandwidth’s API lets customers quickly purchase telephone numbers, send texts, make calls, and, create static or dynamic voice messages. In this post, I show how Bandwidth integrated Amazon Polly to provide on-demand TTS capabilities. I also offer some simple suggestions for leveraging Amazon Polly’s ability to cache results.

I explain how to use Amazon Polly to most effectively meet our customer’s needs. To illustrate, I have provided TTS sample service demo code that you can run right out of the box. The demo code calls Amazon Polly using many of the improvements I discuss in this blog. For caveats for using the demo service, see the README file.

The service

The workflow for the Bandwidth API is straightforward. When the Bandwidth API receives a request for TTS, it directs the request to our new internal service. The internal service will check for a cached response, then direct the request to Amazon Polly, or, if Amazon Polly can’t complete the request for any reason, to a lower-priority TTS vendor. Finally, we store the successful speech result in our cache.

Why (primarily) Amazon Polly?

Why did we decide to use Amazon Polly as our primary TTS engine? We reviewed the most important requirements for our use case: uptime, choice of voices, speed, interpretation, cost, and caching. Amazon Polly delivers on all these requirements.

Of course, uptime is an important requirement for any vendor. Issues with our previous vendor’s uptime drove us to investigate new solutions, and made building a redundant system extremely important.

With our previous (now secondary) vendor, we gave customers access to a variety of voices. Our new vendor had to match our previous selection of voices. Amazon Polly offers at least one male and female voice in every language we support, and has a variety of voices in English.

Speed was another important factor. Because menus and the exchange of live information are a big part of our use case, the ability to start streaming a response back to customers as soon as possible is critical. Nobody likes waiting on a phone menu.

When dealing with text generated by customers, contextual interpretation of input is also important. Although Amazon Polly’s option to use SSML to guarantee outcomes is an excellent feature, it’s impossible to know the intent all of the text that our customers send us. Having a service that can successfully disambiguate English, for example, to properly read “live” in the text “I live at this address,” as opposed to “we broadcast live,” considerably improves the user experience. This is a feature that Amazon Polly diligently supports. In our testing, we found a single case where Amazon Polly did not generate the expected speech in response to input. The Amazon Polly team was eager to hear about that case. Now the audio plays as expected.

Cost and caching go together. Caching is another feature that made Amazon Polly very appealing. Although providing customers the ability to quickly convert dynamic messages to voice is important, many messages are frequently repeated. Our previous vendor did not allow caching of responses, which required our API to call for every request. Being able to cache static messages can reduce cost significantly. At Bandwidth, we’re seeing a 78% cache hit rate, resulting in far fewer requests.