Posted On: Jul 27, 2021

Amazon Textract, a machine learning service that extracts text and structured data from any document or image, now offers specialized support for invoices and receipts. Until today, these important documents were difficult to process at scale because they do not follow set design rules, and often require context to interpret correctly. For example, customers might need to extract the vendor name from the Amazon logo at the top of an invoice even though it is not labeled “Vendor: Amazon”. Now with Textract, customers can extract explicitly labeled data, implied data, and line items from itemized list of goods or services from almost any invoice or receipt without any templates or configuration.

Starting today, Amazon Textract adds the following capabilities for receipts and invoices: 1) Identifies Vendor Name - Amazon Textract can find the vendor name on a receipt even if it's only indicated within a logo on the page without an explicit label called “vendor”. It can also find and extract item, quantity, and prices that are not labeled with column headers for line items, 2) Enables consolidation of output from many documents - Textract normalizes keynames and column headers when extracting data from invoices and receipts, into a standard taxonomy. For example, it detects that “invoice no.” “invoice number” and “receipt #” are identical and outputs “INVOICE_RECEIPT_ID,” so that downstream applications can easily compare output from many documents, and 3) Extracts line item details, even when the column headers are missing - Textract extracts line items including items, quantities, and prices of individual goods purchased from an invoice or a receipt. If the table of line items does not include column headers, Textract now infers what the column headers are meant to be based on the table content.

Let’s hear from one of our customers:

Founded in 2010, Paymerang facilitates electronic supplier payments for businesses that are simple, secure, and profitable. "We help customers in several verticals simplify their accounts payable processes by eliminating routine tasks, paying their suppliers electronically, and earning cash rebates in the process" said Jason Losh, Director of Enterprise Platforms at Paymerang. "We use Amazon Textract, a HIPAA eligible service, to help our customers in the healthcare vertical automatically extract data from invoices without using custom logic to standardize the extracted information. By extracting and classifying data into a consistent set of standard fields, Amazon Textract helps us serve customers who use vendors that do not follow a common pattern for invoice layouts".

For further information on this feature see the documentation explaining it in more detail and a blog post that describes how to use Textract for invoices and receipts with a new API called AnalyzeExpense. Here is a link to the pricing page.

AnalyzeExpense will be launched in waves, starting with the Asia Pacific (Singapore) region on July 26th, followed by Europe (Ireland) on July 27th, Asia Pacific (Sydney), US East (Ohio), US West (Northern California) on July 28th, Europe (Frankfurt), Europe (London), US East (N. Virginia) on July 29th, Asia Pacific (Seoul), Asia Pacific (Mumbai), Canada (Central), Europe (Paris), US West (Oregon) on July 30th, and GovCloud (US-East), GovCloud (US-West) regions on August 2nd.