Amazon Textract

Amazon Textract is a machine learning (ML) service that uses OCR to automatically extract text, handwriting, and data from scanned documents such as PDFs.

Key Features

All extracted data is returned with bounding box coordinates, which is a polygon frame that encompasses each piece of identified data, such as a single word, a line, a table, or even individual cells within a table. 

Optical Character Recognition (OCR)

Amazon Textract uses Optical Character Recognition (OCR) technology to automatically detect printed text, handwriting, and numbers in a scan or rendering of a document, such as a legal document or a scan of a book. 

Analyze Lending

Analyze Lending API is a managed, preconfigured intelligent document processing API that is designed to extract information from loan packages.

Form Extraction

Amazon Textract enables you to detect key-value pairs in document images automatically so that you can retain the inherent context of the document without any manual intervention. A key-value pair is a set of linked data items. For instance, on a document the field “First Name” would be the key and “Jane” would be the value. This makes it easy to import the extracted data into a database or to provide it as a variable into an application.

Table Extraction

Amazon Textract preserves the composition of data stored in tables during extraction. You can use this feature to automatically load the extracted data into a database using a pre-defined schema. 

Signature Detection

Amazon Textract provides the ability to detect signatures on any document or image, such as checks, loan application forms, and claims forms.  

Query based extraction

Amazon Textract provides you with the flexibility to specify the data you need to extract from documents using queries. You can specify the information you need in the form of natural language questions (e.g., “What is the customer name”) and receive the exact information (e.g., ”John Doe”) as part of the API response.  

Handwriting Recognition

Many documents such as medical intake forms or employment applications contain both handwritten and printed text. Amazon Textract can extract printed text and handwriting from documents written in English with high confidence scores, whether it is free-form text or text embedded in tables and forms. Documents can also contain a mix of typed text or handwritten text.

Invoices and Receipts

Amazon Textract can extract relevant data such as contact information, items purchased, and vendor name, from almost any invoice or receipt without the need for any templates or configuration. Invoices and receipts come in various layouts which makes it difficult and time consuming to manually extract data at scale. Amazon Textract uses ML to understand the context of invoices and receipts and automatically extracts data such as vendor name, invoice number, item prices, total amount, and payment terms to suite your business needs.

Identity documents

Amazon Textract uses machine learning (ML) to understand the context of identity documents such as U.S. passports and driver’s licenses without the need for templates or configuration. You can automatically extract specific information such as date of expiry and date of birth, as well as intelligently identify and extract implied information such as name and address.  

Bounding Boxes

All extracted data is returned with bounding box coordinates, which is a polygon frame that encompasses each piece of identified data, such as a single word, a line, a table, or even individual cells within a table. 

Adjustable Confidence Thresholds

When information is extracted from documents, Amazon Textract returns a confidence score for everything it identifies so that you can make informed decisions about how you want to use the results. 

Built-in Human Review Workflow

Amazon Textract is directly integrated with Amazon Augmented AI (Amazon A2I) so you can easily implement human review of printed text and handwriting extracted from documents. Many text extraction applications require humans to review low confidence predictions to ensure the results are correct. But building human review systems can be time consuming and expensive because it involves implementing complex processes or “workflows”, writing custom software to manage review tasks and results, and in many cases, managing large groups of reviewers. Amazon A2I provides built-in human review workflows for text extraction from documents, which allows predictions from Amazon Textract to be reviewed easily. You can choose a confidence threshold for your application, and all predictions with a confidence below the threshold are automatically sent to human reviewers for validation. You can also specify which key/value pairs should be sent for human review. Lastly, you can also configure A2I to send randomly selected documents for human review. With Amazon A2I, you can use a pool of reviewers within your own organization, or you can access the workforce of over 500,000 independent contractors who are already performing machine learning tasks through Amazon Mechanical Turk. You can also make use of workforce vendors that are pre-screened by AWS for quality and adherence to security procedures.

Additional Information

For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see https://docs.aws.amazon.com/index.html. This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at http://aws.amazon.com/agreement, or other agreement between you and AWS governing your use of AWS’s services.