Posted On: Mar 24, 2022

Amazon Comprehend now supports documents in image formats in addition to text, PDFs, and Word. Customers can now use Comprehend custom entity recognition to extract entities from image files (JPG, PNG, TIFF) and can also use Comprehend directly on Amazon Textract JSON outputs to extract custom entities from documents. With this launch customers can simplify their intelligent document processing (IDP) workflows, taking advantage of an out-of-the-box integration between Comprehend and Textract to extract entities from documents. Below is a detailed description of these features:

Custom NER on image files - Amazon Comprehend previously launched custom entity recognition support for PDF and Word documents (see announcement for details). Starting today, customers can use Comprehend to also extract information from documents in image files (JPG, PNG, TIFF) to further support diverse document processing workflows. This feature removes the need of post-processing OCR output prior to completing entity extraction with Comprehend. Customers first annotate and train a custom entity recognition model on PDF documents. The trained custom entity recognition model leverages both the natural language and positional information (e.g. coordinates) of the text to accurately extract custom entities from PDF, Word, plain text, and now, image formats during inference. See documentation for more details. 

Custom NER on Textract JSON outputs - Starting today, customers can use their Textract DetectDocumentText or AnalyzeDocument JSON outputs as an input during Comprehend custom NER inference. By leveraging an existing Textract output, customers can further simplify their document processing workflows (saving time and money), and extend their workflows to extract custom entities from a broader set of documents. See documentation for more details.

To learn more and get started, visit the Amazon Comprehend product page.