AWS Machine Learning Blog

AWS Collaborates with Emory University to Develop Cloud-Based NLP Research Platform Using Apache MXNet

Natural Language Processing (NLP) is a research field in artificial intelligence that aims to develop computer programs’ understanding of human (natural) language. Even if you don’t know much about NLP, there is a good chance that you have been using it daily. Whenever you type a word using the virtual keyboard on your phone, it gives you a list of suggestions for the next word to type by analyzing the content. This is a technique known as language modeling, one of the core tasks in NLP, which measures the probabilities of words likely to follow the one you just typed based on the content. NLP has been adapted by many applications and has begun to make a real impact in the world.

The Evolution of Language and Information Technology (ELIT) team is a group of NLP researchers at Emory University focused on bringing the state-of-the-art NLP and machine learning technology to the research community. The primary focus of the ELIT project is to provide an end-to-end NLP pipeline scalable for big data analysis using the rich resources of the AWS Cloud.  Unlike many other NLP frameworks, ELIT supports a web API, making it platform independent. This way researchers can enjoy large-scale computing anywhere, anytime. The ELIT project is also on GitHub. It was developed by the Emory NLP research group in active collaboration with the AWS MXNet team. In this blog post, we describe the ELIT platform and give a demonstration of its web API, as well as NLP visualization.

The ELIT research platform

With the recent advent of deep learning in NLP, machine-learning-based NLP models began demanding extensive computing power, making it difficult for researchers without powerful machines to leverage the latest techniques in NLP. Cloud computing platforms such as AWS provide researchers with unlimited computing resources to run those models. However, this can be cumbersome for those who are not so familiar with the cloud.  The motivation behind the ELIT project is to provide a web service for NLP so that anyone with an internet connection can make a request to the service. No local installation or prior-knowledge about cloud computing is required. The following are examples of how to leverage the ELIT platform for popular NLP tasks, such as sentiment analysis.

Sentiment analysis

Before we walk through the demo, we’ll briefly explain our approach to sentiment analysis, a task of classifying each document into one of three sentiments, negative, neutral, and positive. ELIT provides two Convolutional Neural Network models for sentiment analysis that allow us to analyze data from social media and movie reviews. Given an input document, either a tweet or a movie review, it first creates an input matrix by stacking the vector representation of each word. The input matrix is then fed into the convolution and pooling layer and the convolved output is matched with the attention matrix that measures the intensity of each n-gram in the input document (in our case, n = [1, …, 5]). Finally, the attention output is fed into the softmax layer that predicts the probabilities of negative, neutral, and positive sentiments for the input (see Shin et al., 2017 for more details about our CNN models).

Demonstration

We’ll start with a screenshot from the ELIT demo page (http://demo.elit.cloud/):

On the top left, there is a text box containing the input text, “I watch “the Sound of Music” last night. The ending could have been better. It’s my favorite movie though.”:

On the top right, there are options for tokenization, sentence segmentation, and sentiment analysis with either the Twitter or the movie review model. Currently, the tokenization, segmentation, and sentiment analysis with the movie model are selected:

When you choose the Analyze button, the input text is sent to the ELIT server that runs the selected NLP pipeline, and returns the following output:

The ELIT sentiment visualizer codes the sentiment of each sentence into a color, where red, green, and blue represent the negative, neutral, and positive sentiments, respectively. It also gives an option of depicting which words contribute the most to make those predictions. In the following example, words with small contributions are visualized with higher opacities:

It’s possible to visualize the intensity levels of the words by scales. In the following example, words with high contributions are represented with bigger circles:

Of course, it’s possible to visualize the intensity using both the opacity and scale options:

Web API

The NLP output can be retrieved by the web API using any programming language of your choice. The following shows simple Python code that requests the NLP output for the input text in our example:

import requests

r = requests.post('https://elit.cloud/public/decode/', data={'text': 'I watch “the Sound of Music” last night. The ending could have been better. It’s my favorite movie though.', 'input_format': 'raw', 'tokenize': 1, 'segment': 1, 'sentiment': 'mov'})

print(r.text)

Upon the request, ELIT takes the input as raw text and runs the NLP pipeline for tokenization, sentence segmentation, and sentiment analysis with the movie model, and returns the output through HTTP. The last line prints the NLP output in the JSON format:

[[{"tokens": ["I", "watched", "\u201c", "the", "Sound", "of", "Music", "\u201d", "last", "night", "."], "offsets": [[0, 1], [2, 9], [10, 11], [11, 14], [15, 20], [21, 23], [24, 29], [29, 30], [31, 35], [36, 41], [41, 42]]}, "sentiment": [0.352990984916687, 0.37940868735313416, 0.26760029792785645],
{"tokens": ["The", "ending", "could", "have", "been", "better", "."], "offsets": [[43, 46], [47, 53], [54, 59], [60, 64], [65, 69], [70, 76], [76, 77]], "sentiment": [0.6561509370803833, 0.17596498131752014, 0.16788406670093536]},
{"tokens": ["It", "\u2019s", "my", "favorite", "movie", "though", "."], "offsets": [[78, 80], [80, 82], [83, 85], [86, 94], [95, 100], [101, 107], [107, 108]], "sentiment-mov": [0.021425940096378326, 0.038874078541994095, 0.9397000074386597]}]]

The JSON output follows the format below:

  • documents: a list of documents → [document, …, document]
  • document: a list of sentences → [sentence, …, sentence]
  • sentence: a dictionary whose keys are {tokens, offsets, sentiment}
    • tokens: a list of tokens in the sentence.
    • offsets: a list of offsets indicating the positions of their corresponding tokens in the original text. Each offset is represented by a pair [begin, end] implying the beginning (inclusive) and the ending (exclusive) offsets of the token. The beginning of each document is set to 0 in these offsets.
    • sentiment: a list of [negative, neural, positive] sentiment scores for the sentence.

For more details, visit https://elit.cloud/tutorial/decode/.

Decoding framework

Decoding is supported for both non-registered and registered users. For cost reasons, non-registered users are limited to HTTP requests with input text ≤ 1 MB, whereas registered users can make HTTP requests up to 10 MB. Additionally, registered users can make requests with files containing text ≤ 1 GB. All requests go through the Elastic Load Balancer to ensure scalability. Once the web API server receives a request, it sends the request to the NLP server that generates the NLP output from the input text. If the requested text is greater than 10 MB, the NLP server saves the output to Amazon S3 storage. Finally, the output is sent back to the web API server that stores the information to the database and sends the NLP output to the user.

Roadmap

ELIT currently supports three NLP tasks: tokenization, segmentation, and sentiment analysis. By the second quarter of 2018, the plan is to support most of the core NLP tasks such as part-of-speech tagging, morphological analysis, named entity recognition, dependency parsing, semantic role labeling, and coreference resolution. It will also provide an interface for training custom models. Since ELIT is an open-source project, we are aspiring to attract community involvement to help take the project forward. The following figure shows the projected project milestones:

Additional Resources

Reference

Bonggun Shin, Timothy Lee, and Jinho D. Choi. Lexicon Integrated CNN Models with Attention for Sentiment Analysis. In Proceedings of the EMNLP Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA’17, 149-158, Copenhagen, Denmark, 2017.


About the Authors

Jinho Choi is an assistant professor of Computer Science at Emory University. He has been active in NLP research, especially on the optimization of core NLP tasks for robustness and scalability. He developed an open source project called NLP4J, previously known as ClearNLP, providing NLP components showing state-of-the-art accuracy and speed, which has been widely used for both academic and industrial research. Recently, he started a new project called “Character Mining” that aims to infer explicit and implicit contexts about individual characters in colloquial writing such as dialogs and emails.

 

Joseph Spisak leads AWS’ partner ecosystem focused on Artificial Intelligence and Machine Learning. He has more than 17 years in deep tech working for companies such as Amazon, Intel and Motorola focused mainly on Video, Machine Learning and AI. In his spare time, he plays ice hockey and reads sci-fi.

 

 


The ELIT Team at Emory University

Gary Lai is a Ph.D. student in Computer Science at Emory University.  He is the founder of Jungllle Inc. that provides services for web, app, and API server development. He is the main research scientist of the ELIT project, focusing on the optimization of NLP components for large-scale computing as well as the development of backend API and a scalable infrastructure based on Amazon Web Services (AWS).

 

Bonggun Shin is a Ph.D. student in Computer Science at Emory University. His research focuses on the development of deep learning algorithms in NLP, especially on designing an interpretable document classification model so that the deep learning model is no longer a black-box but becomes comprehensible. Such interpretation helps researchers to understand the behavior of the statistical model, which enables the method to be more practical in reality.

 

Tyler Angert is a senior undergraduate in Computer Science and Math at Emory University. With a background in art and graphic design, he uses Computer Science to visualize data and implement novel user experience concepts. Apart from NLP research, he is involved in mobile health research, studying pediatric asthma management, autism, and using augmented reality (AR) as a new tool in physical therapy.