AWS Machine Learning Blog
Building a custom classifier using Amazon Comprehend
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to find insights and relationships in texts. Amazon Comprehend identifies the language of the text; extracts key phrases, places, people, brands, or events; and understands how positive or negative the text is. For more information about everything Amazon Comprehend can do, see Amazon Comprehend Features.
You may need out-of-the-box NLP capabilities tied to your needs without having to lead a research phase. This would allow you to recognize entity types and perform document classifications that are unique to your business, such as recognizing industry-specific terms and triaging customer feedback into different categories.
Amazon Comprehend is a perfect match for these use cases. In November 2018, Amazon Comprehend added the ability for you to train it to recognize custom entities and perform custom classification. For more information, see Build Your Own Natural Language Models on AWS (no ML experience required).
This post demonstrates how to build a custom text classifier that can assign a specific label to a given text. No prior ML knowledge is required.
About this blog post | |
Time to complete | 1 hour for the reduced dataset ; 2 hours for the full dataset |
Cost to complete | ~ $50 for the reduced dataset ; ~ $150 for the full dataset These include training, inference and model management, see Amazon Comprehend pricing for more details. |
Learning level | Advanced (300) |
AWS services | Amazon Comprehend Amazon S3 AWS Cloud9 |
Prerequisites
To complete this walkthrough, you need an AWS account and access to create resources in AWS IAM, Amazon S3, Amazon Comprehend, and AWS Cloud9 within that account.
This post uses the Yahoo answers corpus cited in the paper Text Understanding from Scratch by Xiang Zhang and Yann LeCun. This dataset is available on the AWS Open Data Registry.
You can also use your own dataset. It is recommended that you train your model with up to 1,000 training documents for each label, and that when you select your labels, suggest labels that are clear and don’t overlap in meaning. For more information, see Training a Custom Classifier.
Solution overview
The walkthrough includes the following steps:
- Preparing your environment
- Creating an S3 bucket
- Setting up IAM
- Preparing data
- Training the custom classifier
- Gathering results
For more information about how to build a custom entity recognizer to extract information such as people and organization names, locations, time expressions, numerical values from a document, see Build a custom entity recognizer using Amazon Comprehend.
Preparing your environment
In this post, you use the AWS CLI as much as possible to speed up the experiment.
AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with a browser. It includes a code editor, debugger, and terminal. AWS Cloud9 comes pre-packaged with essential tools for popular programming languages and the AWS CLI pre-installed, so you don’t need to install files or configure your laptop for this workshop.
Your AWS Cloud9 environment has access to the same AWS resources as the user with which you logged in to the AWS Management Console.
To prepare your environment, complete the following steps:
- On the console, under Services, choose AWS Cloud9.
- Choose Create environment.
- For Name, enter
CustomClassifier
. - Choose Next step.
- Under Environment settings, change the instance type to t2.large.
- Leave other settings at their defaults.
- Choose Next step.
- Review the environment settings and choose Create environment.
It can take up to a few minutes for your environment to be provisioned and prepared. When the environment is ready, your IDE opens to a welcome screen, which contains a terminal prompt.
You can run AWS CLI commands in this prompt the same as you would on your local computer.
- To verify that your user is logged in, enter the following command:
You get the following output which indicates your account and user information:
- Record the account ID to use in the next step.
Keep your AWS Cloud9 IDE opened in a tab throughout this walkthrough.
Creating an S3 bucket
Use the account ID from the previous step to create a globally unique bucket name, such as 123456789012-comprehend
. Enter the following command in your AWS Cloud9 terminal prompt:
The output shows the name of the bucket you created:
Setting up IAM
To authorize Amazon Comprehend to perform bucket reads and writes during the training or during the inference, you must grant Amazon Comprehend access to the S3 bucket that you created. You are creating a data access role in your account to trust the Amazon Comprehend service principal.
To set up IAM, complete the following steps:
- On the console, under Services, choose IAM.
- Choose Roles.
- Choose Create role.
- Select AWS service as the type of trusted entity.
- Choose Comprehend as the service that uses this role.
- Choose Next: Permissions.
The Policy named ComprehendDataAccessRolePolicy
is automatically attached.
- Choose Next: Review
- For Role name, enter
ComprehendBucketAccessRole
. - Choose Create role.
- Record the Role ARN.
You use this ARN when you launch the training of your custom classifier.
Preparing data
In this step, you download the corpus and prepare the data to match Amazon Comprehend’s expected formats for both training and inference. This post provides a script to help you achieve the data preparation for your dataset.
Alternatively, for even more convenience, you can download the prepared data by entering the following two command lines:
If you follow the preceding step, skip the next steps and go directly to the upload part at the end of this section.
If you want to go through the dataset preparation for this walkthrough, or if you are using your own data follow the next steps:
Enter the following command in your AWS Cloud9 terminal prompt to download it from the AWS Open Data registry:
You see a progress bar and then the following output:
Uncompress it with the following command:
You should delete the archive because you are limited in available space in your AWS Cloud9 environment. Use the following command:
You get a folder yahoo_answers_csv
, which contains the following four files:
- classes.txt
- readme.txt
- test.csv
- train.csv
The files train.csv
and test.csv
contain the training samples as comma-separated values. There are four columns in them, corresponding to class index (1 to 10), question title, question content, and best answer. The text fields are escaped using double quotes (“), and any internal double quote is escaped by two double quotes (“”). New lines are escaped by a backslash followed with an “n” character, that is “\n”.
The following code is the overview of file content:
The file classes.txt
contains the available labels.
The train.csv
file contains 1,400,000 lines and test.csv
contains 60,000 lines. Amazon Comprehend uses between 10–20% of the documents submitted for training to test the custom classifier.
The following command indicates that the data is evenly distributed:
You should train your model with up to 1,000 training documents for each label and no more than 1,000,000 documents.
With 20% of 1,000,000 used for testing, that is still plenty of data to train your custom classifier.
Use a shortened version of train.csv
to train your custom Amazon Comprehend model, and use test.csv
to perform your validation and see how well your custom model performs.
For training, the file format must conform to the following requirements:
- File must contain one label and one text per line – 2 columns
- No header
- Format UTF-8, carriage return “\n”.
Labels must be uppercase, can be multi-token, have white space, consist of multiple words connected by underscores or hyphens, or may even contain a comma, as long as it is correctly escaped.
The following table contains the formatted labels proposed for the training.
Index | Original | For training |
1 | Society & Culture | SOCIETY_AND_CULTURE |
2 | Science & Mathematics | SCIENCE_AND_MATHEMATICS |
3 | Health | HEALTH |
4 | Education & Reference | EDUCATION_AND_REFERENCE |
5 | Computers & Internet | COMPUTERS_AND_INTERNET |
6 | Sports | SPORTS |
7 | Business & Finance | BUSINESS_AND_FINANCE |
8 | Entertainment & Music | ENTERTAINMENT_AND_MUSIC |
9 | Family & Relationships | FAMILY_AND_RELATIONSHIPS |
10 | Politics & Government | POLITICS_AND_GOVERNMENT |
When you want your custom Amazon Comprehend model to determine which label corresponds to a given text in an asynchronous way, the file format must conform to the following requirements:
- File must contain one text per line
- No header
- Format UTF-8, carriage return “\n”.
This post includes a script to speed up the data preparation. Enter the following command to copy the script to your local AWS Cloud9 environment:
To launch data preparation, enter the following commands:
This script is tied to the Yahoo corpus and uses the pandas library to format the training and testing datasets to match your Amazon Comprehend expectations. You may adapt it to your own dataset or change the number of items in the training dataset and validation dataset.
When the script is finished (it should run for approximately 11 minutes on a t2.large instance for the full dataset, and in under 5 minutes for the reduced dataset), you have the following new files in your environment:
- comprehend-train.csv
- comprehend-test.csv
Upload the prepared data (either the one you downloaded or the one you prepared) to the bucket you created with the following commands:
Training the custom classifier
You are ready to launch the custom text classifier training. Enter the following command, and replace the role ARN and bucket name with your own:
You get the following output that names the custom classifier ARN:
It is an asynchronous call. You can then track the training progress with the following command:
You get the following output:
When the training is finished, you get the following output:
The training duration may vary; in this case, the training took approximately one hour for the full dataset (20 minutes for the reduced dataset).
The output for the training on the full dataset shows that your model has a recall of 0.72—in other words, it correctly identifies 72% of given labels.
The following screenshot shows the view from the console (Comprehend > Custom Classification > yahoo-answers).
Gathering results
You can now launch an inference job to test how the classifier performs. Enter the following commands:
You get the following output:
Just as you did for the training progress tracking, because the inference is asynchronous, you can check the progress of the newly launched job with the following command:
You get the following output:
When it is completed, JobStatus
changes to COMPLETED
. This takes approximately a few minutes to complete.
Download the results using OutputDataConfig.S3Uri path with the following command:
When you uncompress the output (tar xvzf output.tar.gz), you get a .jsonl file. Each line represents the result of the requested classification for the corresponding line of the document you submitted.
For example, the following code is one line from the predictions:
This means that your custom model predicted with a 96.8% confidence score that the following text was related to the “Entertainment and music” label.
Each line of results also provides the second and third possible labels. You might use these different scores to build your application upon applying each label with a score superior to 40% or changing the model if no single score is above 70%.
Summary
With the full dataset for training and validation, in less than two hours, you used Amazon Comprehend to learn 10 custom categories—and achieved a 72% recall on the test—and to apply that custom model to 60,000 documents.
Try custom categories now from the Amazon Comprehend console. For more information, see Custom Classification. You can discover other Amazon Comprehend features and get inspiration from other AWS blog posts about how to use Amazon Comprehend beyond classification.
Amazon Comprehend can help you power your application with NLP capabilities in almost no time. Happy experimentation!
About the Author
Hervé Nivon is a Solutions Architect who helps startup customers grow their business on AWS. Before joining AWS, Hervé was the CTO of a company generating business insights for enterprises from commercial unmanned aerial vehicle imagery. Hervé has also served as a consultant at Accenture.