PDF document pre-processing with Amazon Textract: Visuals detection and removal

Amazon Textract is a fully managed machine learning (ML) service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract can detect text in a variety of documents, including financial reports, medical records, and tax forms.

In many use cases, you need to extract and analyze documents with various visuals, such as logos, photos, and charts. These visuals contain embedded text that convolutes Amazon Textract output or isn’t required for your downstream process. For example, many real estate evaluation forms or documents contain pictures of houses or trends of historical prices. This information isn’t needed in downstream processes, and you have to remove it before using Amazon Textract to analyze the document. In this post, we illustrate two effective methods to remove these visuals as part of your preprocessing.

Solution overview

For this post, we use a PDF that contains a logo and a chart as an example. We use two different types of processes to convert and detect these visuals, then redact them.

In the first method, we use the OpenCV library canny edge detector to detect the edge of the visuals. For the second method, we write a custom pixel concentration analyzer to detect the location of these visuals.

You can extract these visuals out for further processing, and easily modify the code to fit your use case.

Searchable PDFs are native PDF files usually generated by other applications, such as text processors, virtual PDF printers, and native editors. These types of PDFs retain metadata, text, and image information inside the document. You can easily use libraries like PyMuPDF/fitz to navigate the PDF structure and identify images and text. In this post, we focus on non-searchable or image-based documents.

Option 1: Detecting visuals with OpenCV edge detector

In this approach, we convert the PDF into PNG format, then grayscale the document with the OpenCV-Python library and use the Canny Edge Detector to detect the visual locations. You can follow the detailed steps in the following notebook.

Convert the document to grayscale.

Apply the Canny Edge algorithm to detect contours in the Canny-Edged document.
Identify the rectangular contours with relevant dimensions.

You can further tune and optimize a few parameters to increase detection accuracy depending on your use case:

Minimum height and width – These parameters define the minimum height and width thresholds for visual detection. It’s expressed in percentage of the page size.
Padding – When a rectangle contour is detected, we define the extra padding area to have some flexibility on the total area of the page to be redacted. This is helpful in cases where the texts in the visuals aren’t inside clearly delimited rectangular areas.

Advantages and disadvantages

This approach has the following advantages:

It satisfies most use cases
It’s easy to implement, and quick to get up and running
Its optimum parameters yield good results

However, the approach has the following drawbacks:

For visuals without a bounding box or surrounding edges, the performance may vary depending on the type of visuals
If a block of text is inside large bounding boxes, the whole text block may be considered a visual and get removed using this logic

Option 2: Pixel concentration analysis

We implement our second approach by analyzing the image pixels. Normal text paragraphs retain a concentration signature in its lines. We can measure and analyze the pixel densities to identify areas with pixel densities that aren’t similar to the rest of document. You can follow the detailed steps in the following notebook.

Convert the document to grayscale.
Convert gray areas to white.
Collapse the pixels horizontally to calculate the concentration of black pixels.
Split the document into horizontal stripes or segments to identify those that aren’t full text (extending across the whole page).

For all horizontal segments that aren’t full text, identify the areas that are text vs. areas that are images. This is done by filtering out sections using minimum and maximum black pixel concentration thresholds.
Remove areas identified as non-full text.

You can tune the following parameters to optimize the accuracy of identifying non-text areas:

Non-text horizontal segment thresholds – Define the minimum and maximum black pixel concentration thresholds used to detect non-text horizontal segments in the page.
Non-text vertical segment thresholds – Define the minimum and maximum black pixel concentration thresholds used to detect non-text vertical segments in the page.
Window size – Controls how the page is split in horizontal and vertical segments for analysis (X_WINDOW, Y_WINDOW). It’s defined in number of pixels.
Minimum visual area – Defines the smallest area that can be considered as a visual to be removed. It’s defined in pixels.
Gray range threshold – The threshold for shades of gray to be removed.

Advantages and disadvantages

This approach is highly customizable. However, it has the following drawbacks:

Optimum parameters take longer and to achieve a deeper understanding of the solution
If the document isn’t perfectly rectified (image taken by camera with an angle), this method may fail.

Conclusion

In this post, we showed how you can implement two approaches to redact visuals from different documents. Both approaches are easy to implement. You can get high-quality results and customize either method according to your use case.

To learn more about different techniques in Amazon Textract, visit the public AWS Samples GitHub repo.

About the Authors

Yuan Jiang is a Sr Solution Architect with a focus in machine learning. He’s a member of the Amazon Computer Vision Hero program and the Amazon Machine Learning Technical Field Community.

Victor Rojo is a Sr Partner Solution Architect with Conversational AI focus. He’s also a member of the Amazon Computer Vision Hero program.

Luis Pineda is a Sr Partner Management Solution Architect. He’s also a member of the Amazon Computer Vision Hero program.

Miguel Romero Calvo is a Data Scientist from the AWS Machine Learning Solution Lab.

AWS Machine Learning Blog

PDF document pre-processing with Amazon Textract: Visuals detection and removal

Solution overview

Option 1: Detecting visuals with OpenCV edge detector

Advantages and disadvantages

Option 2: Pixel concentration analysis

Advantages and disadvantages

Conclusion

About the Authors

Resources

Blog Topics

Follow