Amazon Comprehend – Continuously Trained Natural Language Processing
Many years ago I was wandering through the University of Maryland CS Library and found a dusty old book titled What Computers Can’t Do, adjacent to its successor, What Computers Still Can’t Do. The second book was thicker, which made me realize that Computer Science was a worthwhile field to study. While preparing to write this post I found an archive copy of the first book and found an interesting observation:
Since a human being using and understanding a sentence in a natural language requires an implicit knowledge of the sentence’s context-dependent use, the only way to make a computer that could understand and translate a natural language may well be, as Turing suspected, to program it to learn about the world.
This was a very prescient observation and I’d like to tell you about Amazon Comprehend, a new service that actually knows (and is very happy to share) quite a bit about the world!
Introducing Amazon Comprehend
Amazon Comprehend analyzes text and tells you what it finds, starting with the language, from Afrikans to Yoruba, with 98 more in between. It can identify different types of entities (people, places, brands, products, and so forth), key phrases, sentiment (positive, negative, mixed, or neutral), and extract key phrases, all from text in English or Spanish. Finally, Comprehend‘s topic modeling service extracts topics from large sets of documents for analysis or topic-based grouping.
The first four functions (language detection, entity categorization, sentiment analysis, and key phrase extraction) are designed for interactive use, with responses available in hundreds of milliseconds. Topic extraction works on a job-based model, with responses proportional to the size of the collection.
Comprehend is a continuously-trained trained Natural Language Processing (NLP) service. Our team of engineers and data scientists continue to extend and refine the training data, with the goal of making the service increasingly accurate and more broadly applicable over time.
Exploring Amazon Comprehend
You can explore Amazon Comprehend using the Console and then build applications that make use of the Comprehend APIs. I’ll use the opening paragraph from my recent post on Direct Connect to exercise the Amazon Comprehend API Explorer. I simply paste the text into the box and click on Analyze:
Comprehend processes the text at lightning speed, highlights the entities that it identifies (as you can see above), and makes all of the other information available at a click:
Let’s look at each part of the results. Comprehend can detect many categories of entities in the text that I supply:
Here are all of the entities that were found in my text (they can also be displayed in list or raw JSON form):
Here are the first key phrases (the rest are available by clicking Show all):
Language and sentiment are simple and straightforward:
Ok, so those are the interactive functions. Let’s take a look at the batch ones! I already have an S3 bucket that contains several thousand of my older blog posts, an empty one for my output, an IAM role that allows Comprehend to access both. I enter it and click on Create job to get started:
I can see my recent jobs in the Console:
The output appears in my bucket when the job is complete:
For demo purposes I can download the data and take a peek (in most cases I would feed it in to a visualization or analysis tool):
The topic-terms.csv file clusters related terms within a common topic number (first column). Here are the first 25 lines:
The doc-topics.csv file then indicates which files refer to the topics in the first file. Again, the first 25 lines:
Building Applications with Amazon Comprehend
In most cases you will be using the Amazon Comprehend API to add natural language processing to your own applications. Here are the principal interactive functions:
DetectDominantLanguage – Detect the dominant language of the text. Some of the other functions require you to provide this information, so call this function first.
DetectEntities – Detect entities in the text and return them in JSON form.
DetectKeyPhrases – Detect key phrases in the text and return them in JSON form.
DetectSentiment – Detect the sentiment in the text and return POSITIVE, NEGATIVE, NEUTRAL, or MIXED.
There are also four variants of these functions (each prefixed with
Batch) that can process up to 25 documents in parallel. You can use them to build high-throughput data processing pipelines.
Here are the functions that you can use to create and manage topic detection jobs:
StartTopicsDetectionJob – Create a job and start it running.
ListTopicsDetectionJobs – Get the list of current and recent jobs.
DescribeTopicsDetectionJob – Get detailed information about a single job.
Amazon Comprehend is available now and you can start building applications with it today!