Posted On: Sep 21, 2021

Amazon Comprehend has launched a suite of features for Comprehend Custom to enable continuous model improvements by giving developers the ability to create new model versions, to continuously test on specific test sets, and to migrate new models to existing endpoints. Using AutoML, custom entity recognition allows you to customize Amazon Comprehend to identify entities that are specific to your domain; custom classification enables you to easily build custom text classification models using your business-specific labels. Custom models can subsequently be used to perform inference on text documents, both in real-time and batch processing modes. Creating a custom model is simple - no machine learning experience required. Below is a detailed description of these features:

Improved Model Management - For most natural language processing (NLP) projects, models are continuously retrained over time as new data is collected or if there is deviation between the training dataset and documents processed at inference. With model versioning and live endpoint updates, you can continuously retrain new model versions, compare the accuracy metrics across versions, and update live endpoints with the best performing model with a single click.

  • Model Versioning allows you to re-train newer versions of an existing model making it easier to iterate and track the accuracy changes. Each new version can be identified with a unique version ID.
  • Active Endpoint Update enables update of an active synchronous endpoint with a new model. This ensures that you can deploy a new model version into production without any downtime.

Improved Control for Model Training/Evaluation - Data preparation and model evaluation are often the most tedious part of any NLP project. Model evaluation and troubleshooting can often be confusing without a clear indication of the training and test data split. You can now provide separate train and test datasets during model training. We also launched a new training mode which improves inference accuracy on long documents, spanning across multiple paragraphs.

  • Customer Provided Test Dataset allows you to provide an optional test dataset during model training. Previously, you had to manually run an inference job against a test set to evaluate a model. As additional data is collected and new model versions are trained, evaluating model performance using the same test dataset can provide for a fair comparison across model versions.
  • New Training Mode improves the accuracy of the entity recognizer model for long documents, containing multiple paragraphs. During model training using CSV annotations, choosing the ONE_DOC_PER_FILE input format for long documents allows the model to learn more contextual embeddings, significantly improving the model accuracy.

To learn more and get started, visit the Amazon Comprehend product page or our documentation.