AWS Machine Learning Blog

Build Your Own Natural Language Models on AWS (no ML experience required)

At AWS re:Invent last year we announced Amazon Comprehend, a natural language processing service which extracts key phrases, places, peoples’ names, brands, events, and sentiment from unstructured text. Comprehend – which is powered by sophisticated deep learning models trained by AWS – allows any developer to add natural language processing to their applications without requiring any machine learning skills.

Today we are excited to bring new customization features to Comprehend, which allow developers to extend Comprehend to identify natural language terms and classify text which is specialized to their team, business, or industry.

Many customers tell us they have a surplus of data – specifically – data comprising unstructured, natural language. You likely won’t have to look far inside your own organization before you find a treasure trove of potential information, hiding inside reams of customer emails, support tickets, financial reports, product reviews, social media, or advertising copy. Helping find the needle inside this proverbial haystack is something machine learning is particularly good at; machine learning models can be extremely accurate at picking up specific items of interest inside vast swathes of text (such as finding company names in analyst reports), and are sensitive to the sentiment hidden inside language (identifying negative reviews, or positive customer interactions with customer service agents).

While Comprehend has highly accurate models for finding generic terms (such as places and things), customers often want to extend this capability to identify more specific language, such as policy numbers or part codes. This usually involves starting from scratch, and building new, specialized machine learning language models – annotating data, selecting algorithms, tuning parameters, optimizing models, and running them in production. Not only do these steps all require deep machine learning expertise, but they also represent “undifferentiated heavy lifting”; effort which many application developers would rather spend on building new features of their own.

Customize Amazon Comprehend (No ML Experience Required)

Today, we’re helping customers find more needles in more haystacks; no machine learning skills required. Under the hood, Comprehend will do the heavy lifting to build, train, and host the customized machine learning models, and make those models available through a private API.

Custom Entities allows developers to customize Comprehend to identify terms that are specific to their domain. Comprehend will learn from a small private index of examples (a list of policy numbers, and text in which they are used, for example), and train a private, custom model to recognize these in any other block of text. There are no servers to manage, and no algorithms to master.

Custom Classification allows developers to group documents into named categories. Through as few as 50 examples, Comprehend will automatically train a custom classification model that can be used to categorize all your documents. You could group support emails by department, social media posts by product, or analyst reports by business unit. If you don’t have any examples, or your categories change frequently (which is common in social media), Comprehend can also classify based on just the content of the documents, using Topic Modeling.

Customer Success with Amazon Comprehend

When it comes to understanding unstructured text in a specific domain, natural language doesn’t come much more specialized than in the legal profession. The “legalese” used in most legal documents is famous for its complex syntax, nomenclature and structure. It’s a great example of where Comprehend Custom Entitites can help; we worked with LexisNexis while developing these new capabilities, to extract legal entities from hundreds of millions of documents, with very high accuracy.

“We provide legal professionals with insightful research and analytics to help them make informed decisions,” said Rick McFarland, Chief Data Officer of LexisNexis. “Therefore, we are always looking for better ways to discover insights from legal documents. Thanks to Amazon Comprehend’s automatic machine learning, we can now build accurate custom entity recognition models without getting into the complexities associated with ML. The entities that we care about the most, such as judge and attorney, can be identified quickly from more than 200 million documents at greater than 92 percent accuracy.”

New Amazon Comprehend features are now Generally Available

Since the earliest days of AWS, our goal has been to take technology which is traditionally only within reach of large, well-funded organizations, and to put it in the hands of all developers. And just like with services such as EC2 and RDS, to do this for machine learning we need to continue to invent and simplify on behalf of our customers, across the machine learning stack. These new capabilities for Comprehend are a perfect reflection of this spirit; we’re excited to see what you build with them.


Dr. Matt Wood, General Manager of Artificial Intelligence, AWS