All extracted data is returned with bounding box coordinates, which is a polygon frame that encompasses each piece of identified data, such as a single word, a line, a table, or even individual cells within a table. This is helpful for being able to audit where a word or number came from in the source document or to guide the user in document search systems that return scans of original documents as the search result. For example, when searching medical records for patient history details, users can easily make note of the source document and quickly take note for future searches.
Optical character recognition
Amazon Textract uses optical character recognition (OCR) to automatically detect printed text, handwriting, and numbers in a scan or rendering of a document, such as a legal document or a scan of a book.
You can detect key-value pairs in document images automatically and retain the context without manual intervention. A key-value pair is a set of linked data items. For instance, in a document, the field “First Name” is the key and “Jane” is the value. This makes it easy to import the extracted data into a database or provide it as a variable in an application. With traditional OCR solutions, keys and values are extracted as simple text, and their relationship is lost unless hard-coded rules are written and maintained for each form.
Amazon Textract preserves the composition of data stored in tables during extraction. This is helpful for documents that are largely composed of structured data, such as financial reports or medical records with tables in columns and rows. You can automatically load the extracted data into a database using a predefined schema. For example, rows of item numbers and quantities in an inventory report will retain their association so an inventory management application can easily increment item totals.
Query based extraction
Amazon Textract provides you with the flexibility to specify the data you need to extract from documents using queries. You can specify the information you need in the form of natural language questions (e.g., “What is the customer name”) and receive the exact information (e.g., ”John Doe”) as part of the API response. You do not need to know the data structure in the document (table, form, implied field, nested data) or worry about variations across document versions and formats. Textract Queries are pre-trained on a large variety of documents including paystubs, bank statements, W-2s, loan application forms, mortgage notes, claims documents, and insurance cards. The flexibility that Textract Queries provides reduces the need to implement post processing, reliance on manual reviews of extracted data or the need to train ML models.
Many documents, such as medical intake forms and employment applications, include both handwritten and printed text. Amazon Textract can extract both from documents written in English with high confidence scores, whether the text is free-form or embedded in tables. Documents can also contain a mix of typed text and handwritten text.
Invoices and receipts
Invoices and receipts can have a wide variety of layouts, which makes it difficult and time-consuming to manually extract data at scale. Amazon Textract uses machine learning (ML) to understand the context of invoices and receipts and automatically extracts relevant data such as vendor name, invoice number, item prices, total amount, and payment terms.
Amazon Textract uses machine learning (ML) to understand the context of identity documents such as U.S. passports and driver’s licenses without the need for templates or configuration. You can automatically extract specific information such as date of expiry and date of birth, as well as intelligently identify and extract implied information such as name and address. Using Analyze ID, businesses providing ID verification services and those in finance, healthcare, and insurance can easily automate account creation, appointment scheduling, employment applications, and more by allowing customers to submit a picture or scan of their identity document.
All extracted data is returned with bounding box coordinates—polygon frames that encompass each piece of identified data, such as a word, a line, a table, or individual cells within a table. This helps you audit where a word or number came from in the source document and guides you when search results provide scans of original documents. For example, when searching medical records for patient history details, you can easily find the source document and take note for future searches.
Adjustable confidence thresholds
When extracting information from documents, Amazon Textract returns a confidence score for everything it identifies so you can make informed decisions about how to use the results. For instance, if you extract information from tax records and want to ensure high accuracy, you can flag any item with a confidence score below 95% to be reviewed by a human. You can set a lower threshold for other documents where errors would have fewer negative consequences, such as when processing resumes or digitizing archived records.
Built-in human review workflow
Amazon Textract is directly integrated with Amazon Augmented AI (A2I) so you can easily implement human review of printed text and handwriting extracted from documents. Many text-extraction applications require humans to review low-confidence predictions to ensure the results are correct, but building human review systems can be time-consuming and expensive. Amazon A2I provides built-in human review workflows so you can review predictions easily. Choose a confidence threshold for your application, and all predictions with a confidence below the threshold are automatically sent to human reviewers for validation. You can also specify which key-value pairs should be sent for human review and configure A2I to send randomly selected documents for review as well. Use a pool of reviewers within your organization or access the workforce of over 500,000 independent contractors who are already performing ML tasks through Amazon Mechanical Turk. You can also use workforce vendors that are pre-screened by AWS for quality and adherence to security procedures. To learn more about implementing human review workflows, see the Amazon A2I website and Amazon A2I Integration with Amazon Textract in the developer guide.
Amazon Textract pricing
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. With Amazon Textract, you pay only for what you use. There are no minimum fees and no upfront commitments. Amazon Textract charges only for pages processed whether you extract text, text with tables, form data, queries or process invoices and identity documents. See the FAQ for additional details about pages and acceptable use of Textract.
Get started with Amazon Textract with no upfront commitments or long-term contracts.
Instantly get access to the AWS Free Tier.
Get started building with Amazon Textract in the AWS Management Console.