Announcing specialized support for extracting data from invoices and receipts using Amazon Textract
Receipts and invoices are documents that are critical to small and medium businesses (SMBs), startups, and enterprises for managing their accounts payable processes. These types of documents are difficult to process at scale because they follow no set design rules, yet any individual customer encounters thousands of distinct types of these documents.
In this post, we show how you can use Amazon Textract’s new Analyze Expense API to extract line item details in addition to key-value pairs from invoices and receipts, which is a frequent request we hear from customers. Amazon Textract uses machine learning (ML) to understand the context of invoices and receipts, and automatically extracts specific information like vendor name, price, and payment terms. In this post, we walk you through processing an invoice/receipt using Amazon Textract and extracting a set of fields and line-item details. While AWS takes care of building, training, and deploying advanced ML models in a highly available and scalable environment, you take advantage of these models with simple-to-use API actions.
We cover the following topics in this post:
- How Amazon Textract processes invoices and receipts
- A walkthrough of the Amazon Textract console
- Anatomy of the Amazon Textract
- How to process the response with the Amazon Textract parser library
- Sample solution architecture to automate invoice and receipts processing
- How to deploy and use the solution
Invoice and receipt processing using Amazon Textract
SMBs, startups, and enterprises process paper-based invoices and receipts as part of their accounts payable process to reconcile their goods received and for auditing purposes. Employees who submit expense reports also submit scans or images of the associated receipts. Companies try to standardize electronic invoicing, but some vendors only offer paper invoices, and some countries legally require paper invoices.
The peculiarities of invoices and receipts mean it’s also a difficult problem to solve at scale—invoices and receipts all look different, because each vendor designs its own documents independently. The labels are imperfect and inconsistent. Vendor name is often not explicitly labeled and has to be interpreted based on context. Other important information such as customer number, customer ID, or account ID are labeled differently from document to document.
To solve this problem, you can use Amazon Textract to process invoices and receipts at scale. Amazon Textract works with any style of invoice or receipt, no templates or configuration required, and extracts relevant data that can be tricky to extract such as contact information, items purchased, and vendor name from those documents. That includes the line-item details, not just the headline amounts.
Amazon Textract also identifies vendor names that are critical for your workflows but may not be explicitly labeled. For example, Amazon Textract can find the vendor name on a receipt even if it’s only indicated within a logo at the top of the page without an explicit key-value pair combination.
Amazon Textract also makes it easy to consolidate input from diverse receipts and invoices. Different documents use different words for the same concept. For example, Amazon Textract maps relationships between field names in different documents such as
customer number, and
account ID, and outputs standard taxonomy (in this case,
INVOICE_RECEIPT_ID), thereby representing data consistently across document types.
Amazon Textract console walkthrough
Before we get started with the API and code samples, let’s review the Amazon Textract console. The following images show examples of both an invoice and a receipt document on the Analyze Expense output tab of the Amazon Textract console.
Amazon Textract automatically detects the vendor name, invoice number, ship to address, and more from the sample invoice and displays them on the Summary Fields tab. It also represents the standard taxonomy of fields in brackets next to the actual value on the document. For example, it identifies “INVOICE #” as the standard field
Additionally, Amazon Textract detects the items purchased and displays them on the Line Item Fields tab.
The following is a similar example of a receipt. Amazon Textract detects “Whole Foods Market” as
VENDOR_NAME even though the receipt doesn’t explicitly mention it as the vendor name.
The Amazon Textract AnalyzeExpense API response
In this section, we explain the
AnalyzeExpense API response structure using sample images. The following is a sample receipt and the corresponding
AnalyzeExpense response JSON.
AnalyzeExpense JSON response of
AnalyzeExpense JSON response for
AnalyzeExpense JSON output contains
ExpenseDocuments, and each
ExpenseIndex field uniquely identifies the expense, and associates the appropriate
LineItemGroups detected to that expense.
The most granular level of data in the
AnalyzeExpense response consists of
LabelDetection (optional). Let’s call this set of data an
AnalyzeExpense element. The preceding example illustrates an
AnalyzeExpense element that contains
In the preceding example, Amazon Textract detected 16
SummaryField key-value pairs, including
VENDOR_NAME: New Store X1 and
Order type:Quick Sale.
AnalyzeExpense detects this key-value pair and displays it as shown in the preceding example. The individual entities are as follows:
- LabelDetection – The optional key of the key-value pair. In the
Order type: Quick Saleexample, it’s
Order type:. For implied values such as
Vendor Name, where the key isn’t explicitly shown in the receipt,
LabelDetectionisn’t included in the
AnalyzeExpenseelement. In the preceding example, “New Store X1” at the top of the receipt is the vendor name without an explicit key. The
AnalyzeExpenseelement for “New Store X1” has a type of
ValueDetectionof New Store X1, but doesn’t have a
- Type – This is the normalized type of the key-value pair. Because
Order typeisn’t a normalized taxonomy value, it’s classified as
OTHER. Examples of normalized values are
Receiver Address, and
Payment Terms. For a full list of normalized taxonomy values, see the Amazon Textract Developer Guide.
- ValueDetection – The value of the key-value pair. In the example of
Order type: Quick Sale, it’s
AnalyzeExpense API also detects
PRICE within line items as normalized fields. If other text is in a line item on the receipt image, such as SKU or a detailed description, it’s included in the JSON as
EXPENSE_ROW, as shown in the following example:
In addition to the detected content, the
AnalyzeExpense API provides information like confidence scores and bounded boxes for detected elements. It gives you control of how you consume extracted content and integrate it into your applications. For example, you can flag any elements that have a confidence score under a certain threshold for manual review.
The input document is either bytes or an Amazon Simple Storage Service (Amazon S3) object. You pass image bytes to an Amazon Textract API operation by using the
Bytes property. For example, you use the Bytes property to pass a document loaded from a local file system.
Image bytes passed by using the Bytes property must be base64 encoded. Your code might not need to encode document file bytes if you’re using an AWS SDK to call Amazon Textract API operations. Alternatively, you can pass images stored in an S3 bucket to an Amazon Textract API operation by using the
S3Object property. Documents stored in an S3 bucket don’t need to be base64 encoded.
You can call the
AnalyzeExpense API using the AWS Command Line Interface (AWS CLI), as shown in the following code. Make sure you have the latest AWS CLI version installed.
Process the response with the Amazon Textract parser library
Apart from working with the JSON output as-is, you can use the Amazon Textract response parser library to parse the JSON returned by the
AnalyzeExpense API. The library parses JSON and provides programming language-specific constructs to work with different parts of the document. For more details, refer to the GitHub repo. Using the Amazon Textract response parser makes it easier to deserialize the JSON response and use it in your application in a similar way that the Amazon Textract PrettyPrinter library allows you to print the parsed response in different formats. The following GitHub repository shows examples for parsing the Amazon Textract responses. You can parse SummaryFields and
LineItemGroups for every
ExpenseDocument in the
AnalyzeExpense response JSON using the
AnalyzeExpense parser as shown in the following code:
You can further use the serializer and deserializer to validate the response JSON and convert it into the Python object representation, and vice versa.
The following code deserializes the response JSON:
The following code serializes the response JSON:
You can also convert the output to formats like CSV, Presto, TSV, HTML, LaTeX, and more by using the Amazon Textract PrettyPrinter library.
Install the PrettyPrinter library with the following code:
get_string method of
textractprettyprinter.t_pretty_print_expense with the
table_format set to whichever format you want to output. The following example code outputs both
LINEITEMGROUPS in the fancy grid format:
The following is the PrettyPrinter output for a sample receipt.
The following is another example of detecting structured data from an invoice.
AnalyzeExpense detects the various normalized summary fields like
It also detected one
LineItemGroup with one
PRICE, as shown in the following PrettyPrinter output.
Solution architecture overview
The following diagram is a common solution architecture pattern you can use to process documents using Amazon Textract. The solution uses the new
AnalyzeExpense API to process receipts and invoices on Amazon S3 and stores the results back in Amazon S3. For more details and implementation code, refer to the GitHub repo.
The solution architecture includes the following steps:
- The input and output S3 buckets store the input expense documents (images) in PNG and JPEG formats and the
AnalyzeExpensePrettyPrinter outputs, respectively.
- An event rule based on an event pattern in Amazon EventBridge matches incoming S3
PutObjectevents in the input S3 bucket containing the raw expense document images.
- The configured EventBridge rule sends the event to an AWS Lambda function for further processing.
- The Lambda function reads the images from the input S3 bucket, calls the
AnalyzeExpenseAPI, uses the Amazon Textract response parser to deserialize the JSON response, uses Amazon Textract PrettyPrinter to easily print the parsed response, and stores the results back to the S3 bucket in different formats.
Deploy and use the solution
You can deploy the solution architecture using an AWS CloudFormation template that performs much of the setup work for you.
- Choose Launch Stack to deploy the solution architecture in the US East (N. Virginia) Region.
- Don’t make any changes to stack name or parameters.
- In the Capabilities section, select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create stack.
To use the solution, upload the receipt and invoice images in the S3 bucket referred by
SourceBucket in the CloudFormation template. This triggers an event to invoke the Lambda function that calls the
AnalyzeExpense API and parses the response JSON, converts the parsed response into CSV or
fancy_grid format, and stores it back to another S3 bucket (referred by
OutputBucket in the CloudFormation template).
You can extend the provided Lambda function further based on your requirements and also change the output format to other types like TSV, grid, LaTex, and many more by setting the appropriate value of
output_type when calling the
get_string method of
textractprettyprinter.t_pretty_print_expense in Amazon Textract PrettyPrinter.
The sample Lambda function deployment package included in this CloudFormation template consists of the Boto3 SDK as well. If you want to upgrade the Boto3 SDK in future, you either need to create a new deployment package with the upgraded Boto3 SDK or use the latest Boto3 SDK provided by the Lambda Python runtime.
Clean up resources
To delete the resources that the CloudFormation template created, complete the following steps:
- Delete the Input, Output and Logging Amazon S3 Buckets created by the CloudFormation template.
- On the AWS CloudFormation console, select the stack that you created.
- On the Actions menu, choose Delete.
In this post, we provided an overview of the new Amazon Textract
AnalyzeExpense API to quickly and easily retrieve structured data from receipts and invoices. We also described how you can parse the
AnalyzeExpense response JSON using the Amazon Textract parser library and save the output in different formats using Amazon Textract PrettyPrinter. Finally, we provided a solution architecture and sample code on the GitHub repo for processing invoices and receipts using Amazon S3, EventBridge, and a Lambda function.
For more information, see the Amazon Textract Developer Guide.
About the Authors
Dhawalkumar Patel is a Sr. Startups Machine Learning Solutions Architect at AWS with expertise in Machine Learning and Serverless domains. He has worked with organizations ranging from large enterprises to startups on problems related to distributed computing and artificial intelligence
Manish Chugh is a Sr. Solutions Architect at AWS based in San Francisco, CA. He has worked with organizations ranging from large enterprises to early-stage startups. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS. In his free time, he enjoys hiking East Bay trails, road biking, and watching (and playing) cricket.