Automate Labeling for Intelligent Document Processing with Cognizant and Amazon SageMaker Ground Truth

By Phil Basford, CTO for Data & AI – Cognizant
By Chetan Makvana, Sr. Solutions Architect – AWS

Cognizant

Customers across multiple industries use intelligent document processing (IDP) to automate information extraction from millions of documents with diverse formats. IDP accelerates data retrieval to become faster and more accurate, informs business decisions, and reduces overall costs compared to manual data entry process.

The key to an accurate IDP system is a large, high-quality labeled training dataset. This enables the machine learning (ML) model to recognize patterns and extract information accurately.

Manually labeling thousands of documents is not always feasible, and even using human labelers would be expensive, slow, and difficult to manage. This poses a challenge when building real-world document processing applications, since accuracy depends on adequate training data.

In this post, we explain how Cognizant’s IDP solution utilizes Amazon Web Services (AWS) architecture to help customers automate scaling and manage labeling for IDP.

Cognizant is an AWS Premier Tier Services Partner and Managed Service Provider (MSP) that transforms customers’ business, operating, and technology models for the digital era by helping organizations envision, build, and run more innovative and efficient businesses.

Context and Challenges

Labeling is a key process in intelligent document processing systems. It categorizes and annotates documents to enable machine learning models to extract information accurately.

IDP systems use ML algorithms that require labeled training data to learn how to identify document types, layouts, and components. Precise labeling enables models to effectively identify nuances and achieve better automated data extraction with fewer errors. Labeling also allows customization of IDP systems to meet an organization’s specific document formats and data types.

However, labeling of documents poses several challenges:

Manual effort: Large volume of labeled data are crucial for accurate ML models. However, manually labeling many documents is expensive and time-consuming. Subject matter experts (SMEs) need to carefully read and annotate the data to identify relevant information, but this process lacks scalability.
Domain expertise: Large language models (LLM) and transfer learning have reduced labeling requirements, but accurate labeling still requires domain expertise to understand specialized terms and improve model performance.
Integration of training data and inference pipeline: To make sure models perform well in real-time predictions, it’s crucial to connect the training data with the inference pipeline. This ensures models learn from similar data they predict.
Complex information extraction: Extracting complex information from documents often requires linking terms and entities together.

Solution Overview

Cognizant’s IDP solution comprises the following primary stages of document labeling:

The process begins by uploading documents (PDF, JPG, TIFF) to an Amazon Simple Storage Service (Amazon S3) bucket.
This triggers an AWS Lambda function, which starts the Job Management process using an AWS Step Functions workflow.
It then creates a labeling job using custom user interface (UI) within Amazon SageMaker Ground Truth to let the SMEs draw and label the documents.
Finally, upon completion of the labeling process, the solution coverts the labeled data into various formats required by different models for training.

Figure 1 – Solution architecture.

Key Components

Job Management

The Job Management workflow in AWS Step Functions has the following steps:

Figure 2 – Job Management workflow.

Create file list: This step creates an array of file names from the event and the files in Amazon S3.
Map each document: This step takes the file names as input and batches them. The Map state then processes each item by converting PDFs to JPG, extracting text and locations from images using Amazon Textract, converting the verbose Textract output into data frame and storing it in a CSV file.
Generate manifests: This step takes the list of files created in the first step and matches the entries to the images and extracted text for each page. It generates a job manifest like the one below:

{"source": "s3://examplebucket/batch_1/images/doc1_001.png",
"original_file_path": "s3:// examplebucket/batch_1/doc1.pdf",
"textract-ref": "s3:// examplebucket/batch_1/textract/dataframes/doc1_001.csv", "page-num": 1} 

{"source": "s3://examplebucket/batch_1/images/doc1_002.png",
"original_file_path": "s3://examplebucket/batch_1/doc1.pdf",
"textract-ref": "s3://examplebucket/batch_1/textract/dataframes/doc1_002.csv", "page-num": 2} 

…. 

{"source": "s3://examplebucket/batch_1/images/doc1_010.png",
"original_file_path": "s3://examplebucket/batch_1/doc1.pdf",
"textract-ref": "s3://examplebucket/batch_1/textract/dataframes/doc1_010.csv", "page-num": 10}

Create labeling job: This step creates a new job per manifest file in Amazon SageMaker Ground Truth.

Amazon SageMaker Ground Truth and Custom UI

The custom UI within Amazon SageMaker Ground Truth lets the SMEs draw and label the documents. It renders the image for the page stored at the source location in the manifest file, and allows the SMEs to draw bounding boxes before labeling each.

Figure 3 – Document labeling.

The solution uses Crowd HTML Elements for the custom UI. Here’s an example of the template used for this solution:

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script> 
<crowd-form> 
<crowd-bounding-box 
name="boundingBox" 
src="{{ task.input.taskObject | grant_read_access }}" 
header="Please draw bounding boxes around the entities" 
labels="{{ task.input.labels | to_json | escape }}" 
> 

<full-instructions header="Bounding box instructions"> 
<ol> 
<li>Inspect the page and determine the words that should be labelled.</li> 
<li>Outline each instance of the specified label 
<ol> 
<li>For the sub elements (e.g Peril / Location etc.) label each word / words</li> 
<li> 
For the outer elements (e.g Sublimit / Deductible) label the full sentence(s) wrapping all the 
sub elements within. 
</li> 
</ol> 
</li> 
<li>Once complete, click submit.</li> 
</ol> 
<p> 
An example labelled document is below showing (1) Three sub entities for peril, location and amount and 
(2) an outer entity that wraps the three sub-entities: 

<img src="https://example.com/assets/images/groundtruth/Instructions_Sublimits_Deductibles.png" 
alt="Sublimits / Deductibles Example" width="800px"/>. 
</p> 

</full-instructions> 

<short-instructions> 
<h3>Overview</h3> 
<ol> 
<li>Inspect the page and determine the words that should be labelled.</li> 
<li>Outline each instance of the specified label 
<ol> 
<li>For the sub elements (e.g Peril / Location etc.) label each word / words</li> 
<li> 
For the outer elements (e.g Sublimit / Deductible) label the full sentence(s) wrapping all the 
sub elements within. 
</li> 
</ol> 
</li> 
<li>Once complete, click submit.</li> 

</ol> 

</short-instructions> 

</crowd-bounding-box> 
</crowd-form>

Upon job completion, the output is stored in Amazon S3 for subsequent processing.

Post-Processing

The post-processing stage triggers upon Amazon SageMaker Ground Truth job completion. A SageMaker Ground Truth job sends an event to Amazon EventBridge which triggers a Lambda function, which takes the manifest for each job and merges the labeling output with the simplified CSV output from Amazon Textract.

The merged output can then be converted into the following various data formats for different model requirements:

Image bounding boxes for object detection datasets like Coco.
Text for foundation models.
Text and bounding boxes for models.
Templated prompt for generative AI LLMs.

Considerations

Following are some considerations learned while building this solution:

Provide clear instructions with examples to SMEs. Assign two SMEs per job–one to label and one to verify.
Ensure manifests cover all document pages and labeling in custom UI is graphical. This minimizes SME work and helps visualizes page relationships and section spanning.
Use clear Amazon S3 prefixes structure to organize files, images, outputs, and manifests.
Automate process of merging all the documents into one training set for the customer Named Entity Recognition (NER) model in Amazon Comprehend. Previously, combining manifests took a data scientist 2-3 hours per run. With this solution, the process takes under five minutes, significantly reducing both errors and required time to complete.

Impact

The ability to perform transfer learning or fine-tuning for foundation models is pivotal in enhancing machine learning solutions utilizing such models. It allows the models to become domain-specific and more accurate. The outcome observed from employing this custom model in Amazon Comprehend resulted in substantial boosts in accuracy and F1 scores.

Figure 4 – Accuracy and F1 scores.

Conclusion

Intelligent document processing (IDP) extracts information from documents automatically. However, developing accurate systems requires large volumes of high-quality labeled data for training. Manually labeling so many documents is impractical and expensive, but the Cognizant IDP solution automates document labeling at scale to overcome this challenge.

The rise of foundation models in generative AI allows for handling complex data. Fine-tuning and alignment are still needed for accurate outputs, and Cognizant utilizes this solution to generate prompt data for reinforcement learning through human feedback and alignment models, allowing adjustments to the outputs of foundation models.

If you’d like to know more about this solution or Cognizant’s capabilities in AI/ML, please contact Cognizant.

.

.

Cognizant – AWS Partner Spotlight

Cognizant is an AWS Premier Tier Services Partner and MSP that transforms customers’ business, operating, and technology models for the digital era by helping organizations envision, build, and run more innovative and efficient businesses.

Contact Cognizant | Partner Overview | AWS Marketplace | Case Studies

AWS Partner Network (APN) Blog