AWS Machine Learning Blog

Announcing specialized support for extracting data from invoices and receipts using Amazon Textract

Receipts and invoices are documents that are critical to small and medium businesses (SMBs), startups, and enterprises for managing their accounts payable processes. These types of documents are difficult to process at scale because they follow no set design rules, yet any individual customer encounters thousands of distinct types of these documents.

In this post, we show how you can use Amazon Textract’s new Analyze Expense API to extract line item details in addition to key-value pairs from invoices and receipts, which is a frequent request we hear from customers. Amazon Textract uses machine learning (ML) to understand the context of invoices and receipts, and automatically extracts specific information like vendor name, price, and payment terms. In this post, we walk you through processing an invoice/receipt using Amazon Textract and extracting a set of fields and line-item details. While AWS takes care of building, training, and deploying advanced ML models in a highly available and scalable environment, you take advantage of these models with simple-to-use API actions.

We cover the following topics in this post:

  • How Amazon Textract processes invoices and receipts
  • A walkthrough of the Amazon Textract console
  • Anatomy of the Amazon Textract AnalyzeExpense API response
  • How to process the response with the Amazon Textract parser library
  • Sample solution architecture to automate invoice and receipts processing
  • How to deploy and use the solution

Invoice and receipt processing using Amazon Textract

SMBs, startups, and enterprises process paper-based invoices and receipts as part of their accounts payable process to reconcile their goods received and for auditing purposes. Employees who submit expense reports also submit scans or images of the associated receipts. Companies try to standardize electronic invoicing, but some vendors only offer paper invoices, and some countries legally require paper invoices.

The peculiarities of invoices and receipts mean it’s also a difficult problem to solve at scale—invoices and receipts all look different, because each vendor designs its own documents independently. The labels are imperfect and inconsistent. Vendor name is often not explicitly labeled and has to be interpreted based on context. Other important information such as customer number, customer ID, or account ID are labeled differently from document to document.

To solve this problem, you can use Amazon Textract to process invoices and receipts at scale. Amazon Textract works with any style of invoice or receipt, no templates or configuration required, and extracts relevant data that can be tricky to extract such as contact information, items purchased, and vendor name from those documents. That includes the line-item details, not just the headline amounts.

Amazon Textract also identifies vendor names that are critical for your workflows but may not be explicitly labeled. For example, Amazon Textract can find the vendor name on a receipt even if it’s only indicated within a logo at the top of the page without an explicit key-value pair combination.

Amazon Textract also makes it easy to consolidate input from diverse receipts and invoices. Different documents use different words for the same concept. For example, Amazon Textract maps relationships between field names in different documents such as customer no., customer number, and account ID, and outputs standard taxonomy (in this case, INVOICE_RECEIPT_ID), thereby representing data consistently across document types.

Amazon Textract console walkthrough

Before we get started with the API and code samples, let’s review the Amazon Textract console. The following images show examples of both an invoice and a receipt document on the Analyze Expense output tab of the Amazon Textract console.

Amazon Textract automatically detects the vendor name, invoice number, ship to address, and more from the sample invoice and displays them on the Summary Fields tab. It also represents the standard taxonomy of fields in brackets next to the actual value on the document. For example, it identifies “INVOICE #” as the standard field INVOICE_RECEIPT_ID.

Additionally, Amazon Textract detects the items purchased and displays them on the Line Item Fields tab.

The following is a similar example of a receipt. Amazon Textract detects “Whole Foods Market” as VENDOR_NAME even though the receipt doesn’t explicitly mention it as the vendor name.

The Amazon Textract AnalyzeExpense API response

In this section, we explain the AnalyzeExpense API response structure using sample images. The following is a sample receipt and the corresponding AnalyzeExpense response JSON.

Sample store receipt image

AnalyzeExpense JSON response of SummaryFields :

{
    "DocumentMetadata": {
        "Pages": 1
    },
    "ExpenseDocuments": [
        {
            "ExpenseIndex": 1,
            "SummaryFields": [
                {
                    "Type": {
                        "Text": "VENDOR_NAME",
                        "Confidence": 97.0633544921875
                    },
                    "ValueDetection": {
                        "Text": "New Store X1",
                        "Geometry": {
                            …
                        },
                        "Confidence": 96.65239715576172
                    },
                    "PageNumber": 1
                },
                {
                    "Type": {
                        "Text": "OTHER",
                        "Confidence": 81.0
                    },
                    "LabelDetection": {
                        "Text": "Order type:",
                        "Geometry": {
                            …
                        },
                        "Confidence": 80.8591079711914
                    },
                    "ValueDetection": {
                        "Text": "Quick Sale",
                        "Geometry": {
                            …
                        },
                        "Confidence": 80.82302856445312
                    },
                    "PageNumber": 1
                }
…

AnalyzeExpense JSON response for LineItemGroups:

"LineItemGroups": [
                {
                    "LineItemGroupIndex": 1,
                    "LineItems": [
                        {
                            "LineItemExpenseFields": [
                                {
                                    "Type": {
                                        "Text": "ITEM",
                                        "Confidence": 99.95216369628906
                                    },
                                    "ValueDetection": {
                                        "Text": "Red Banana is in\nbusiness ",
                                        "Geometry": {
                                            …
                                        },
                                        "Confidence": 99.81525421142578
                                    },
                                    "PageNumber": 1
                                },
                                {
                                    "Type": {
                                        "Text": "PRICE",
                                        "Confidence": 99.95216369628906
                                    },
                                    "ValueDetection": {
                                        "Text": "$66.96",
                                        "Geometry": {
                                            …
                                        },

The AnalyzeExpense JSON output contains ExpenseDocuments, and each ExpenseDocument contains SummaryFields and LineItemGroups. The ExpenseIndex field uniquely identifies the expense, and associates the appropriate SummaryFields or LineItemGroups detected to that expense.

The most granular level of data in the AnalyzeExpense response consists of Type, ValueDetection, and LabelDetection (optional). Let’s call this set of data an AnalyzeExpense element. The preceding example illustrates an AnalyzeExpense element that contains Type, ValueDetection, and LabelDetection.

In the preceding example, Amazon Textract detected 16 SummaryField key-value pairs, including VENDOR_NAME: New Store X1 and Order type:Quick Sale. AnalyzeExpense detects this key-value pair and displays it as shown in the preceding example. The individual entities are as follows:

  • LabelDetection – The optional key of the key-value pair. In the Order type: Quick Sale example, it’s Order type:. For implied values such as Vendor Name, where the key isn’t explicitly shown in the receipt, LabelDetection isn’t included in the AnalyzeExpense element. In the preceding example, “New Store X1” at the top of the receipt is the vendor name without an explicit key. The AnalyzeExpense element for “New Store X1” has a type of VENDOR_NAME and ValueDetection of New Store X1, but doesn’t have a LabelDetection.
  • Type – This is the normalized type of the key-value pair. Because Order type isn’t a normalized taxonomy value, it’s classified as OTHER. Examples of normalized values are Vendor Name, Receiver Address, and Payment Terms. For a full list of normalized taxonomy values, see the Amazon Textract Developer Guide.
  • ValueDetection – The value of the key-value pair. In the example of Order type: Quick Sale, it’s Quick Sale.

The AnalyzeExpense API also detects ITEM, QUANTITY, and PRICE within line items as normalized fields. If other text is in a line item on the receipt image, such as SKU or a detailed description, it’s included in the JSON as EXPENSE_ROW, as shown in the following example:

{
                                    "Type": {
                                        "Text": "EXPENSE_ROW",
                                        "Confidence": 99.95216369628906
                                    },
                                    "ValueDetection": {
                                        "Text": "Red Banana is in x3 $66.96\nbusiness ",
                                        "Geometry": {
                                          …
                                        },
                                        "Confidence": 98.11214447021484
                                    }

In addition to the detected content, the AnalyzeExpense API provides information like confidence scores and bounded boxes for detected elements. It gives you control of how you consume extracted content and integrate it into your applications. For example, you can flag any elements that have a confidence score under a certain threshold for manual review.

The input document is either bytes or an Amazon Simple Storage Service (Amazon S3) object. You pass image bytes to an Amazon Textract API operation by using the Bytes property. For example, you use the Bytes property to pass a document loaded from a local file system.

Image bytes passed by using the Bytes property must be base64 encoded. Your code might not need to encode document file bytes if you’re using an AWS SDK to call Amazon Textract API operations. Alternatively, you can pass images stored in an S3 bucket to an Amazon Textract API operation by using the S3Object property. Documents stored in an S3 bucket don’t need to be base64 encoded.

You can call the AnalyzeExpense API using the AWS Command Line Interface (AWS CLI), as shown in the following code. Make sure you have the latest AWS CLI version installed.

aws textract analyze-expense --document '{"S3Object": {"Bucket": "<Your Bucket>","Name": "Invoice/Receipts S3 Objects"}}'

Process the response with the Amazon Textract parser library

Apart from working with the JSON output as-is, you can use the Amazon Textract response parser library to parse the JSON returned by the AnalyzeExpense API. The library parses JSON and provides programming language-specific constructs to work with different parts of the document. For more details, refer to the GitHub repo. Using the Amazon Textract response parser makes it easier to deserialize the JSON response and use it in your application in a similar way that the Amazon Textract PrettyPrinter library allows you to print the parsed response in different formats. The following GitHub repository shows examples for parsing the Amazon Textract responses. You can parse SummaryFields and LineItemGroups for every ExpenseDocument in the AnalyzeExpense response JSON using the AnalyzeExpense parser as shown in the following code:

Install the latest boto3 python SDK -
python3 -m pip install boto3 –-upgrade 

Install the latest version of amazon textract response parser 
python3 -m pip install amazon-textract-response-parser --upgrade

client = boto3.client(
         service_name='textract',
         region_name= 'us-east-1',
         endpoint_url='https://textract.us-east-1.amazonaws.com',
)

with open(documentName, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        print('Image loaded', documentName)

# process using image bytes
response = client.analyze_expense(Document={'Bytes': bytes_test})

You can further use the serializer and deserializer to validate the response JSON and convert it into the Python object representation, and vice versa.

The following code deserializes the response JSON:

# j holds the Textract JSON
from trp.trp2_expense import TAnalyzeExpenseDocument, TAnalyzeExpenseDocumentSchema
t_doc = TAnalyzeExpenseDocumentSchema().load(json.loads(j))

The following code serializes the response JSON:

from trp.trp2_expense import TAnalyzeExpenseDocument, TAnalyzeExpenseDocumentSchema
t_doc = TAnalyzeExpenseDocumentSchema().dump(t_doc)

You can also convert the output to formats like CSV, Presto, TSV, HTML, LaTeX, and more by using the Amazon Textract PrettyPrinter library.

Install the PrettyPrinter library with the following code:

python3 -m pip install amazon-textract-prettyprinter --upgrade

Call the get_string method of textractprettyprinter.t_pretty_print_expense with the output_type as SUMMARY or LINEITEMGROUPS and table_format set to whichever format you want to output. The following example code outputs both SUMMARY and LINEITEMGROUPS in the fancy grid format:

import os
import boto3
from textractprettyprinter.t_pretty_print_expense import get_string
from textractprettyprinter.t_pretty_print_expense import Textract_Expense_Pretty_Print, Pretty_Print_Table_Format

"""
boto3 client for Amazon Texract
"""
textract = boto3.client(service_name='textract')

"""
Set the S3 Bucket Name and File name 
Please set the below variables to your S3 Location
"""
s3_source_bucket_name = "YOUR S3 BUCKET NAME"
s3_request_file_name = "YOUR S3 EXPENSE IMAGE FILENAME "
    
"""
Call the Textract AnalyzeExpense API with the input Expense Image in Amazon S3
"""
try:
    response = textract.analyze_expense(
        Document={
            'S3Object': {
                'Bucket': s3_source_bucket_name,
                'Name': s3_request_file_name
            }
        })
    """
    Call Amazon Pretty Printer get_string method to parse the response and print it in fancy_grid format. 
    You can set pretty print format to other types as well like csv, latex etc.
    """
    pretty_printed_string = get_string(textract_json=response, output_type=[Textract_Expense_Pretty_Print.SUMMARY, Textract_Expense_Pretty_Print.LINEITEMGROUPS], table_format=Pretty_Print_Table_Format.fancy_grid)
        
    """
    Use the pretty printed string to save the response in storage of your choice. 
    Below is just printing it on stdout.
    """
    print(pretty_printed_string)    

except Exception as e_raise:
    print(e_raise)
    raise e_raise

The following is the PrettyPrinter output for a sample receipt.

The following is another example of detecting structured data from an invoice.

AnalyzeExpense detects the various normalized summary fields like PAYMENT_TERMS, INVOICER_RECEIPT_ID, TOTAL, TAX, and RECEIVER_ADDRESS.

It also detected one LineItemGroup with one LineItem having DESCRIPTION, QUANTITY, and PRICE, as shown in the following PrettyPrinter output.

Solution architecture overview

The following diagram is a common solution architecture pattern you can use to process documents using Amazon Textract. The solution uses the new AnalyzeExpense API to process receipts and invoices on Amazon S3 and stores the results back in Amazon S3. For more details and implementation code, refer to the GitHub repo.

The solution architecture includes the following steps:

  1. The input and output S3 buckets store the input expense documents (images) in PNG and JPEG formats and the AnalyzeExpense PrettyPrinter outputs, respectively.
  2. An event rule based on an event pattern in Amazon EventBridge matches incoming S3 PutObject events in the input S3 bucket containing the raw expense document images.
  3. The configured EventBridge rule sends the event to an AWS Lambda function for further processing.
  4. The Lambda function reads the images from the input S3 bucket, calls the AnalyzeExpense API, uses the Amazon Textract response parser to deserialize the JSON response, uses Amazon Textract PrettyPrinter to easily print the parsed response, and stores the results back to the S3 bucket in different formats.

Deploy and use the solution

You can deploy the solution architecture using an AWS CloudFormation template that performs much of the setup work for you.

  1. Choose Launch Stack to deploy the solution architecture in the US East (N. Virginia) Region.

  1. Don’t make any changes to stack name or parameters.
  2. In the Capabilities section, select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Create stack.

To use the solution, upload the receipt and invoice images in the S3 bucket referred by SourceBucket in the CloudFormation template. This triggers an event to invoke the Lambda function that calls the AnalyzeExpense API and parses the response JSON, converts the parsed response into CSV or fancy_grid format, and stores it back to another S3 bucket (referred by OutputBucket in the CloudFormation template).

You can extend the provided Lambda function further based on your requirements and also change the output format to other types like TSV, grid, LaTex, and many more by setting the appropriate value of output_type when calling the get_string method of textractprettyprinter.t_pretty_print_expense in Amazon Textract PrettyPrinter.

The sample Lambda function deployment package included in this CloudFormation template consists of the Boto3 SDK as well. If you want to upgrade the Boto3 SDK in future, you either need to create a new deployment package with the upgraded Boto3 SDK or use the latest Boto3 SDK provided by the Lambda Python runtime.

Clean up resources

To delete the resources that the CloudFormation template created, complete the following steps:

  1. Delete the Input, Output and Logging Amazon S3 Buckets created by the CloudFormation template.
  2. On the AWS CloudFormation console, select the stack that you created.
  3. On the Actions menu, choose Delete.

Summary

In this post, we provided an overview of the new Amazon Textract AnalyzeExpense API to quickly and easily retrieve structured data from receipts and invoices. We also described how you can parse the AnalyzeExpense response JSON using the Amazon Textract parser library and save the output in different formats using Amazon Textract PrettyPrinter. Finally, we provided a solution architecture and sample code on the GitHub repo for processing invoices and receipts using Amazon S3, EventBridge, and a Lambda function.

For more information, see the Amazon Textract Developer Guide.


About the Authors

Dhawalkumar Patel is a Sr. Startups Machine Learning Solutions Architect at AWS with expertise in Machine Learning and Serverless domains. He has worked with organizations ranging from large enterprises to startups on problems related to distributed computing and artificial intelligence

 

 

Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.

 

 

 

Manish Chugh is a Sr. Solutions Architect at AWS based in San Francisco, CA. He has worked with organizations ranging from large enterprises to early-stage startups. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS. In his free time, he enjoys hiking East Bay trails, road biking, and watching (and playing) cricket.