Posted On: Dec 1, 2022

Amazon Comprehend announced single-step APIs that customers can now use to classify and extract entities of interest from PDF documents, Microsoft Word files, and images.

Amazon Comprehend is a Natural Language Processing (NLP) service that provides pre-trained and custom APIs to derive insights from textual data. The new capability simplifies document processing by adding support for common document types like PDF documents, Microsoft Word and images, in Amazon Comprehend Custom Classification and Custom Entity Recognition APIs. Previously, to process such documents, customers were required to pre-process and flatten documents into machine-readable text, which can reduce the quality of the document context. Now, with a single API call, customers can process both scanned or digital semi-structured documents (like PDFs, Microsoft Word documents, and images in their native format), and plain-text documents, eliminating pre-processing overhead. Customers can use the new capability to simplify document processing for batch processing or real-time use cases.

Customer can process documents in the English language for contextual entity recognition and German (de), English (en), Spanish (es), French (fr), Italian (it), and Portuguese (pt) languages for document classification. These capabilities are available in all AWS regions where Amazon Comprehend is available. To learn more and get started, visit the Amazon Comprehend Intelligent Document Processing page, AWS News Blog, and our documentation.