Extract custom entities from documents in their native format with Amazon Comprehend

Posted on: Sep 15, 2021

Amazon Comprehend, a natural-language processing (NLP) service that uses machine learning to uncover information in text, now allows you to extract custom entities from documents in a variety of formats (PDF, Word, plain text) and layouts (e.g., bullets, lists). This enables you to more easily extract insights and further automate your document processing workflows.

Prior to this announcement, you could only use Amazon Comprehend on plain text documents, which required you to flatten documents into machine-readable text, often reducing the quality of the context within the document. This new feature combines the power of Natural Language Processing (NLP) and Optical Character Recognition (OCR) to extract custom entities from your PDF, Word, and plain text documents using the same API with no preprocessing required.

The new custom entity recognition feature utilizes the structural context of text (text placement within a page) combined with natural language context to extract custom entities from dense text, numbered lists, and bullets. This combination also allows customers to extract discontiguous or disconnected entities that aren’t immediately part of the same span of text (for example, entities nested within a table). This new feature also removes the need for customers to build custom logic to convert PDF and Word files to flattened, plain text before using Comprehend. By natively supporting new document formats, Comprehend offers key benefits to customers in industries such as mortgage, finance, and insurance companies, who process diverse document formats and layouts. For example, mortgage companies can now process applications faster by extracting an applicant’s bank information, address, and co-signor name from documents such as scanned PDFs of bank statements, pay stubs, and employment verification letters.

To train a custom entity recognition model that can be used on your PDF, Word, and plain text documents, customers need to first annotate PDF documents using a custom Amazon SageMaker Ground Truth annotation template that is provided by Amazon Comprehend. The custom entity recognition model leverages both the natural language and positional information (e.g. coordinates) of the text to accurately extract custom entities that previously may be impacted when flattening a document. For step-by-step details on how to annotate your documents, see Custom document annotation for extracting named entities in documents using Amazon Comprehend. Once you’ve finished annotating, you can train a custom entity recognition model and use it to extract custom entities from PDF and Word for batch (asynchronous) processing. To extract text and spatial locations of text from scanned PDF documents, Amazon Comprehend calls Amazon Textract on your behalf as a step before custom entity recognition. For details on how to train and use your model, see Extract custom entities from documents in their native format with Amazon Comprehend.

Custom entity recognition support for plain text, PDF, and Word documents is available directly via the AWS console and AWS CLI. To view a list of the supported AWS regions for both Comprehend and Textract, please visit the AWS Region Table for all AWS global infrastructure.

To learn more and get started, visit the Amazon Comprehend product page, the intelligent document processing page, or our documentation.

Extract custom entities from documents in their native format with Amazon Comprehend

Learn

Resources

Developers

Help