Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, layout elements, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. All extracted data is returned with bounding box coordinates—polygon frames that encompass each piece of identified data, such as a word, a line, a table, or individual cells within a table. Amazon Textract also returns a confidence score for everything it identifies so you can make informed decisions about how to use the results.
Amazon Textract provides you the ability to customize the pretrained Queries feature and improve extraction accuracy on your business specific document types while you maintain control and ownership of your data. Through the AWS Console you can upload as few as ten sample documents, annotate the data, and customize the pretrained Queries feature within a few hours.
Amazon Textract provides you with the ability to extract layout elements such as paragraphs, titles, lists, headers, footers, and more from documents. Layout is a feature type in the Analyze Document API. Customers can use Layout as a stand-alone feature or in combination with other Analyze Document feature types.
Optical character recognition
Amazon Textract OCR automatically detects printed and handwritten text from documents and images. Textract’s ML powered OCR can recognize text in various fonts and styles, and it can also handle noisy or distorted text.
You can detect key-value pairs in document images automatically and retain the context without manual intervention. A key-value pair is a set of linked data items. For instance, in a document, the field “First Name” is the key and “Jane” is the value. This makes it easy to import the extracted data into a database or provide it as a variable in an application. With traditional OCR solutions, keys and values are extracted as simple text, and their relationship is lost unless hard-coded rules are written and maintained for each form.
Amazon Textract preserves the composition of data stored in tables during extraction. This is helpful for documents that are largely composed of structured data, such as financial reports or medical records with tables in columns and rows. You can automatically load the extracted data into a database using a predefined schema. For example, rows of item numbers and quantities in an inventory report will retain their association so an inventory management application can easily increment item totals.
Amazon Textract provides the ability to detect signatures on any document or image. This makes it easy to automatically detect signatures on documents such as checks, loan application forms, and claims forms. The location of the signatures and associated confidence scores are included in the API response.
Query based extraction
Amazon Textract provides you with the flexibility to specify the data you need to extract from documents using queries. You can specify the information you need in the form of natural language questions (e.g., “What is the customer name”) and receive the exact information (e.g., ”John Doe”) as part of the API response. You do not need to know the data structure in the document (table, form, implied field, nested data) or worry about variations across document versions and formats. Textract Queries are pre-trained on a large variety of documents including paystubs, bank statements, W-2s, loan application forms, mortgage notes, claims documents, and insurance cards. The flexibility that Textract Queries provides reduces the need to implement post processing, reliance on manual reviews of extracted data or the need to train ML models.
Analyze Lending API is a managed, preconfigured intelligent document processing API that fully automates the extraction of information from loan packages. Customers simply upload their mortgage loan documents to the Analyze Lending API and its prebuilt machine learning models will classify and split the document package by document type.
Invoices and receipts
Invoices and receipts can have a wide variety of layouts, which makes it difficult and time-consuming to manually extract data at scale. Amazon Textract uses machine learning (ML) to understand the context of invoices and receipts and automatically extracts relevant data such as vendor name, invoice number, item prices, total amount, and payment terms.
Amazon Textract uses machine learning (ML) to understand the context of identity documents such as U.S. passports and driver’s licenses without the need for templates or configuration. You can automatically extract specific information such as date of expiry and date of birth, as well as intelligently identify and extract implied information such as name and address. Using Analyze ID, businesses providing ID verification services and those in finance, healthcare, and insurance can easily automate account creation, appointment scheduling, employment applications, and more by allowing customers to submit a picture or scan of their identity document.
Amazon Textract pricing
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. With Amazon Textract, you pay only for what you use. There are no minimum fees and no upfront commitments. Amazon Textract charges only for pages processed whether you extract text, text with tables, form data, queries or process invoices and identity documents. See the FAQ for additional details about pages and acceptable use of Textract.
Get started with Amazon Textract with no upfront commitments or long-term contracts.
Instantly get access to the AWS Free Tier.
Get started building with Amazon Textract in the AWS Management Console.