AWS Machine Learning Blog

Merge cells and column headers in Amazon Textract tables

Financial documents such as bank, loan, or mortgage statements are often formatted to be visually appealing and easy to read for the human eye. These same features can also make automated processing challenging at times. For instance, in the following sample statement, merging rows or columns in a table helps reduce information redundancy, but it can become difficult to write the code that identifies the repeating value and assign it to the corresponding elements.

Sample Bank Statement

In April 2022, Amazon Textract introduced a new capability of the table feature that automatically detects merged rows and columns as well as headers. Prior to this enhancement, for a similar document, the table’s output would have contained empty values for the Date column. Customers had to write custom code to detect the beginning of a new row and carry over the appropriate value.

This post walks you through a simple example of how to use the merged cells and headers features.

The new Amazon Textract table response structure

The response schema for the Amazon Textract API is now enhanced with two new structure types, as illustrated in the following diagram:

  • A new block type called MERGED_CELL
  • New cell entity types called COLUMN_HEADER

AnalyzeDocument Response Structure

MERGED_CELL blocks are appended to the JSON document as any other CELL block. The MERGED_CELL blocks have a parent/child relationship with the CELL blocks they combine. This allows you to propagate the same cell value across multiple columns or rows even when not explicitly represented in the document. The MERGED_CELL block itself is then referenced from the TABLE block via a parent/child relationship.

Headers are flagged through a new entity type populated within the corresponding CELL block.

Using the new feature

Let’s try out the new feature on the sample statement presented earlier. The following code snippet calls Amazon Textract to extract tables out of the document, turn the output data into a Pandas DataFrame, and display its content.

We use the following modules in this example:

Let’s initialize the Boto3 session and invoke Amazon Textract with the sample statement as the input document:

session = boto3.Session(profile_name='<your_profile_name>')
documentName = " s3://amazon-textract-public-content/blogs/Textract-MergeCell-Sample-Bank-Statement.pdf"
textract_json = call_textract(input_document=documentName, features = [Textract_Features.TABLES])

Let’s pretty-print the response payload. As you can see, by default the date is not populated across all rows.

print(get_string(textract_json=textract_json, output_type=[Textract_Pretty_Print.TABLES]))
|--------------------|---------|
| Beginning Balance: | $8000.0 |
| Deposits           | $3005.5 |
| Other Subtractions | -1539.5 |
| Checks             |         |
| 0.00               |         |
| Service Fees 0.00  |         |
|-----------|----------|------------------------------|
| Account   | name     | John Doe                     |
| Account   | number   | 00002134001                  |
| Statement | Date:    |                              |
| Date      | February | 1, 2022 to February 28, 2022 |
|----------|-------------------------|---------|---------|--------|----------|
|          |                         |         | Amount  |        |          |
| Date     | Description             | Details | Credits | Debits | Balance  |
| 2/4/2022 | Life Insurance Payments | Credit  |         | 445    | 9500.45  |
|          | Property Management     | Credit  |         | 300    | 9945.45  |
|          | Retail Store 4          | Credit  |         | 65.75  | 10245.45 |
| 2/3/2022 | Electricity Bill        | Credit  |         | 245.45 | 10311.2  |
|          | Water Bill              | Credit  |         | 312.85 | 10556.65 |
|          | Rental Deposit          | Credit  | 3000    |        | 10869.5  |
| 2/2/2022 | Retail Store 3          | Credit  |         | 125    | 7869.5   |
|          | Retail Store 2 Refund   | Debit   | 5.5     |        | 7994.5   |
|          | Retail Store 1          | Credit  |         | 45.5   | 8000     |
| 2/1/2022 | Shoe Store Refund       | Credit  | 33      |        | 8045.5   |
|          | Snack Vending Machine   | Debit   |         | 4      | 8012.5   |

Then, we load the response into a document by using the Amazon Textract response parser module, and reorder the blocks by location:

t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo(t_doc)
trp_doc = Document(TDocumentSchema().dump(ordered_doc))

Now let’s iterate through the tables’ content, and extract the data into a DataFrame:

table_index = 1
dataframes = []
 def combine_headers(top_h, bottom_h):
   bottom_h[3] = top_h[2] + " " + bottom_h[3]
   bottom_h[4] = top_h[2] + " " + bottom_h[4]
 for page in trp_doc.pages:
   for table in page.tables:
     table_data = []
     headers = table.get_header_field_names()         #New Table method to retrieve header column names
     if(len(headers)>0):                   #Let's retain the only table with headers
       print("Statememt headers: "+ repr(headers))
       top_header= headers[0]
       bottom_header = headers[1]
       combine_headers(top_header, bottom_header)     #The statement has two headers. let's combine them
       for r, row in enumerate(table.rows_without_header): #New Table attribute returning rows without headers
         table_data.append([])
         for c, cell in enumerate(row.cells):
           table_data[r].append(cell.mergedText)    #New Cell attribute returning merged cells common values
       if len(table_data)>0:
         df = pd.DataFrame(table_data, columns=bottom_header)

The trp module has been improved to include two new capabilities, as highlighted in the preceding code’s comments:

  • Fetch the header column names
  • Expose merged cells’ repeated value

Displaying the DataFrame produces the following output.

Statement Table Dataframe Display

We can now use multi-level indexing and reproduce the table’s initial structure:

multi = df.set_index(['Date', 'Details'])
display(multi)

Statement Table Dataframe Display 2

For a complete version of the notebook presented in this post, visit the amazon-textract-code-samples GitHub repository.

Conclusion

Amazon Textract already helps you speed up document processing and reduce the number of manual tasks. The table’s new headers and merged cells features help you even further by reducing the need for custom or hard-coded logic. It can also help reduce postprocessing manual corrections.

For more information, please visit the Amazon Textract Response Objects documentation page. And if you are interested in learning more about another recently announced Textract feature, we highly recommend checking out Specify and extract information from documents using the new Queries feature in Amazon Textract.


About the Authors

Narcisse Zekpa is a Solutions Architect based in Boston. He helps customers in the Northeast U.S. accelerate their adoption of the AWS Cloud, by providing architectural guidelines, design innovative, and scalable solutions. When Narcisse is not building, he enjoys spending time with his family, traveling, cooking, and playing basketball.

Martin Schade is a Senior ML Product SA with the Amazon Textract team. He has over 20 years of experience with internet-related technologies, engineering, and architecting solutions. He joined AWS in 2014, first guiding some of the largest AWS customers on the most efficient and scalable use of AWS services, and later focused on AI/ML with a focus on computer vision. Currently, he’s obsessed with extracting information from documents.

Balaji Puli is a Solutions Architect based in Bay Area, CA. Currently helping select Northwest U.S healthcare life sciences customers accelerate their AWS cloud adoption. Balaji enjoys traveling and loves to explore different cuisines.