Optical Character Recognition (OCR)
Amazon Textract uses Optical Character Recognition (OCR) technology to automatically detect printed text and numbers in a scan or rendering of a document, such as a legal document or a scan of a book.
Amazon Textract enables you to detect key-value pairs in document images automatically so that you can retain the inherent context of the document without any manual intervention. A key-value pair is a set of linked data items. For instance, on a document the field “First Name” would be the key and “Jane” would be the value. This makes it easy to import the extracted data into a database or to provide it as a variable into an application. With traditional OCR solutions, keys and values are extracted as simple text. The relationship between them is lost unless hard-coded rules are written and maintained for each form.
Amazon Textract preserves the composition of data stored in tables during extraction. This is helpful for documents that are largely composed of structured data, such as financial reports or medical records that have column names in the top row of the table followed by rows of individual entries. You can use this feature to automatically load the extracted data into a database using a pre-defined schema. For example, rows of item numbers and quantities in an inventory report will retain their association to easily increment item totals in an inventory management application.
All extracted data is returned with bounding box coordinates, which is a polygon frame that encompasses each piece of identified data, such as a single word, a line, a table, or even individual cells within a table. This is helpful for being able to audit where a word or number came from in the source document or to guide the user in document search systems that return scans of original documents as the search result. For example, when searching medical records for patient history details, users can easily make note of the source document and quickly take note for future searches.
Adjustable Confidence Thresholds
When information is extracted from documents, Amazon Textract returns a confidence score for everything it identifies so that you can make informed decisions about how you want to use the results. For instance, if you are extracting information from tax documents and want to ensure high accuracy, then you can create business logic to flag any extracted information with a confidence score lower than 95% to be reviewed by a human. However, you may choose a lower threshold for other types of documents where the consequences of an error have little to no negative consequences like processing resumes or digitizing archived documents.
Get started with Amazon Textract with no upfront commitments or long-term contracts.
Instantly get access to the AWS Free Tier.
Get started building with Amazon Textract in the AWS Management Console.