AWS News Blog

Amazon Comprehend – Continuously Trained Natural Language Processing

Many years ago I was wandering through the University of Maryland CS Library and found a dusty old book titled What Computers Can’t Do, adjacent to its successor, What Computers Still Can’t Do. The second book was thicker, which made me realize that Computer Science was a worthwhile field to study. While preparing to write this post I found an archive copy of the first book and found an interesting observation:

Since a human being using and understanding a sentence in a natural language requires an implicit knowledge of the sentence’s context-dependent use, the only way to make a computer that could understand and translate a natural language may well be, as Turing suspected, to program it to learn about the world.

This was a very prescient observation and I’d like to tell you about Amazon Comprehend, a new service that actually knows (and is very happy to share) quite a bit about the world!

Introducing Amazon Comprehend
Amazon Comprehend analyzes text and tells you what it finds, starting with the language, from Afrikans to Yoruba, with 98 more in between. It can identify different types of entities (people, places, brands, products, and so forth), key phrases, sentiment (positive, negative, mixed, or neutral), and extract key phrases, all from text in English or Spanish. Finally, Comprehend‘s topic modeling service extracts topics from large sets of documents for analysis or topic-based grouping.

The first four functions (language detection, entity categorization, sentiment analysis, and key phrase extraction) are designed for interactive use, with responses available in hundreds of milliseconds. Topic extraction works on a job-based model, with responses proportional to the size of the collection.

Comprehend is a continuously-trained trained Natural Language Processing (NLP) service. Our team of engineers and data scientists continue to extend and refine the training data, with the goal of making the service increasingly accurate and more broadly applicable over time.

Exploring Amazon Comprehend
You can explore Amazon Comprehend using the Console and then build applications that make use of the Comprehend APIs. I’ll use the opening paragraph from my recent post on Direct Connect to exercise the Amazon Comprehend API Explorer. I simply paste the text into the box and click on Analyze:

Comprehend processes the text at lightning speed, highlights the entities that it identifies (as you can see above), and makes all of the other information available at a click:

Let’s look at each part of the results. Comprehend can detect many categories of entities in the text that I supply:

Here are all of the entities that were found in my text (they can also be displayed in list or raw JSON form):

Here are the first key phrases (the rest are available by clicking Show all):

Language and sentiment are simple and straightforward:

Ok, so those are the interactive functions. Let’s take a look at the batch ones! I already have an S3 bucket that contains several thousand of my older blog posts, an empty one for my output, an IAM role that allows Comprehend to access both. I enter it and click on Create job to get started:

I can see my recent jobs in the Console:

The output appears in my bucket when the job is complete:

For demo purposes I can download the data and take a peek (in most cases I would feed it in to a visualization or analysis tool):

$ aws s3 ls s3://comp-out/348414629041-284ed5bdd23471b8539ed5db2e6ae1a7-1511638148578/output/
2017-11-25 19:45:09     105308 output.tar.gz
$ aws s3 cp s3://comp-out/348414629041-284ed5bdd23471b8539ed5db2e6ae1a7-1511638148578/output/output.tar.gz .
download: s3://comp-out/348414629041-284ed5bdd23471b8539ed5db2e6ae1a7-1511638148578/output/output.tar.gz to ./output.tar.gz
$ gzip -d output.tar.gz
$ tar xf output.tar
$ ls -l
total 1020
-rw-r--r-- 1 ec2-user ec2-user 495454 Nov 25 19:45 doc-topics.csv
-rw-rw-r-- 1 ec2-user ec2-user 522240 Nov 25 19:45 output.tar
-rw-r--r-- 1 ec2-user ec2-user  20564 Nov 25 19:45 topic-terms.csv
$

The topic-terms.csv file clusters related terms within a common topic number (first column). Here are the first 25 lines:

topic,term,weight
000,aw,0.0926182
000,week,0.0326755
000,announce,0.0268909
000,blog,0.0206818
000,happen,0.0143501
000,land,0.0140561
000,quick,0.0143148
000,stay,0.014145
000,tune,0.0140727
000,monday,0.0125666
001,cloud,0.0521465
001,quot,0.0292118
001,compute,0.0164334
001,aw,0.0245587
001,service,0.018017
001,web,0.0133253
001,video,0.00990734
001,security,0.00810732
001,enterprise,0.00626157
001,event,0.00566274
002,storage,0.0485621
002,datar,0.0279634
002,gateway,0.015391
002,s3,0.0218211

The doc-topics.csv file then indicates which files refer to the topics in the first file. Again, the first 25 lines:

docname,topic,proportion
calillona_brows.html,015,0.577179
calillona_brows.html,062,0.129035
calillona_brows.html,003,0.128233
calillona_brows.html,071,0.125666
calillona_brows.html,076,0.039886
amazon-rds-now-supports-sql-server-2012.html,003,0.851638
amazon-rds-now-supports-sql-server-2012.html,059,0.061293
amazon-rds-now-supports-sql-server-2012.html,032,0.050921
amazon-rds-now-supports-sql-server-2012.html,063,0.036147
amazon-rds-support-for-ssl-connections.html,048,0.373476
amazon-rds-support-for-ssl-connections.html,005,0.197734
amazon-rds-support-for-ssl-connections.html,003,0.148681
amazon-rds-support-for-ssl-connections.html,032,0.113638
amazon-rds-support-for-ssl-connections.html,041,0.100379
amazon-rds-support-for-ssl-connections.html,004,0.066092
zipkeys_simplif.html,037,1.0
cover_art_appli.html,093,1.0
reverse-dns-for-ec2s-elastic-ip-addresses.html,040,0.359862
reverse-dns-for-ec2s-elastic-ip-addresses.html,048,0.254676
reverse-dns-for-ec2s-elastic-ip-addresses.html,042,0.237326
reverse-dns-for-ec2s-elastic-ip-addresses.html,056,0.085849
reverse-dns-for-ec2s-elastic-ip-addresses.html,020,0.062287
coming-soon-oracle-database-11g-on-amazon-rds-1.html,063,0.368438
coming-soon-oracle-database-11g-on-amazon-rds-1.html,041,0.193081

Building Applications with Amazon Comprehend
In most cases you will be using the Amazon Comprehend API to add natural language processing to your own applications. Here are the principal interactive functions:

DetectDominantLanguage – Detect the dominant language of the text. Some of the other functions require you to provide this information, so call this function first.

DetectEntities – Detect entities in the text and return them in JSON form.

DetectKeyPhrases – Detect key phrases in the text and return them in JSON form.

DetectSentiment – Detect the sentiment in the text and return POSITIVE, NEGATIVE, NEUTRAL, or MIXED.

There are also four variants of these functions (each prefixed with Batch) that can process up to 25 documents in parallel. You can use them to build high-throughput data processing pipelines.

Here are the functions that you can use to create and manage topic detection jobs:

StartTopicsDetectionJob – Create a job and start it running.

ListTopicsDetectionJobs – Get the list of current and recent jobs.

DescribeTopicsDetectionJob – Get detailed information about a single job.

Now Available
Amazon Comprehend is available now and you can start building applications with it today!

Jeff;