Automate Labeling for Intelligent Document Processing with Inawisdom and Amazon SageMaker Ground Truth
By Phil Basford, CTO – Inawisdom
By Chetan Makvana, Sr. Solutions Architect – AWS
Customers across multiple industries use intelligent document processing (IDP) to automate information extraction from millions of documents with diverse formats. IDP accelerates data retrieval to become faster and more accurate, informs business decisions, and reduces overall costs compared to manual data entry process.
The key to an accurate IDP system is a large, high-quality labeled training dataset. This enables the machine learning (ML) model to recognize patterns and extract information accurately.
Manually labeling thousands of documents is not always feasible, and even using human labelers would be expensive, slow, and difficult to manage. This poses a challenge when building real-world document processing applications, since accuracy depends on adequate training data.
In this post, we explain how Inawisdom’s IDP solution utilizes Amazon Web Services (AWS) architecture to help customers automate scaling and manage labeling for IDP.
Inawisdom – a Cognizant Company is an AWS Premier Tier Services Partner with Competencies in Machine Learning, Data and Analytics, DevOps, and Financial Services. Inawisdom helps customers accelerate the adoption of advanced analytics, artificial intelligence (AI), and machine learning on AWS. Inawisdom was acquired by Cognizant in 2020 and is now part of Cognizant’s UK&I Consulting business.
Context and Challenges
Labeling is a key process in intelligent document processing systems. It categorizes and annotates documents to enable machine learning models to extract information accurately.
IDP systems use ML algorithms that require labeled training data to learn how to identify document types, layouts, and components. Precise labeling enables models to effectively identify nuances and achieve better automated data extraction with fewer errors. Labeling also allows customization of IDP systems to meet an organization’s specific document formats and data types.
However, labeling of documents poses several challenges:
- Manual effort: Large volume of labeled data are crucial for accurate ML models. However, manually labeling many documents is expensive and time-consuming. Subject matter experts (SMEs) need to carefully read and annotate the data to identify relevant information, but this process lacks scalability.
- Domain expertise: Large language models (LLM) and transfer learning have reduced labeling requirements, but accurate labeling still requires domain expertise to understand specialized terms and improve model performance.
- Integration of training data and inference pipeline: To make sure models perform well in real-time predictions, it’s crucial to connect the training data with the inference pipeline. This ensures models learn from similar data they predict.
- Complex information extraction: Extracting complex information from documents often requires linking terms and entities together.
Inawisdom’s IDP solution comprises the following primary stages of document labeling:
- The process begins by uploading documents (PDF, JPG, TIFF) to an Amazon Simple Storage Service (Amazon S3) bucket.
- This triggers an AWS Lambda function, which starts the Job Management process using an AWS Step Functions workflow.
- It then creates a labeling job using custom user interface (UI) within Amazon SageMaker Ground Truth to let the SMEs draw and label the documents.
- Finally, upon completion of the labeling process, the solution coverts the labeled data into various formats required by different models for training.
Figure 1 – Solution architecture.
The Job Management workflow in AWS Step Functions has the following steps:
- Create file list: This step creates an array of file names from the event and the files in Amazon S3.
- Map each document: This step takes the file names as input and batches them. The Map state then processes each item by converting PDFs to JPG, extracting text and locations from images using Amazon Textract, converting the verbose Textract output into data frame and storing it in a CSV file.
- Generate manifests: This step takes the list of files created in the first step and matches the entries to the images and extracted text for each page. It generates a job manifest like the one below:
- Create labeling job: This step creates a new job per manifest file in Amazon SageMaker Ground Truth.
Amazon SageMaker Ground Truth and Custom UI
The custom UI within Amazon SageMaker Ground Truth lets the SMEs draw and label the documents. It renders the image for the page stored at the source location in the manifest file, and allows the SMEs to draw bounding boxes before labeling each.
The solution uses Crowd HTML Elements for the custom UI. Here’s an example of the template used for this solution:
Upon job completion, the output is stored in Amazon S3 for subsequent processing.
The post-processing stage triggers upon Amazon SageMaker Ground Truth job completion. A SageMaker Ground Truth job sends an event to Amazon EventBridge which triggers a Lambda function, which takes the manifest for each job and merges the labeling output with the simplified CSV output from Amazon Textract.
The merged output can then be converted into the following various data formats for different model requirements:
- Image bounding boxes for object detection datasets like Coco.
- Text for foundation models.
- Text and bounding boxes for models.
- Templated prompt for generative AI LLMs.
Following are some considerations learned while building this solution:
- Provide clear instructions with examples to SMEs. Assign two SMEs per job–one to label and one to verify.
- Ensure manifests cover all document pages and labeling in custom UI is graphical. This minimizes SME work and helps visualizes page relationships and section spanning.
- Use clear Amazon S3 prefixes structure to organize files, images, outputs, and manifests.
- Automate process of merging all the documents into one training set for the customer Named Entity Recognition (NER) model in Amazon Comprehend. Previously, combining manifests took a data scientist 2-3 hours per run. With this solution, the process takes under five minutes, significantly reducing both errors and required time to complete.
The ability to perform transfer learning or fine-tuning for foundation models is pivotal in enhancing machine learning solutions utilizing such models. It allows the models to become domain-specific and more accurate. The outcome observed from employing this custom model in Amazon Comprehend resulted in substantial boosts in accuracy and F1 scores.
Intelligent document processing (IDP) extracts information from documents automatically. However, developing accurate systems requires large volumes of high-quality labeled data for training. Manually labeling so many documents is impractical and expensive, but the Inawisdom IDP solution automates document labeling at scale to overcome this challenge.
The rise of foundation models in generative AI allows for handling complex data. Fine-tuning and alignment are still needed for accurate outputs, and Inawisdom utilizes this solution to generate prompt data for reinforcement learning through human feedback and alignment models, allowing adjustments to the outputs of foundation models.
If you’d like to know more about this solution or Inawisdom’s capabilities in AI/ML, please contact Inawisdom.
Inawisdom – AWS Partner Spotlight
Inawisdom – a Cognizant Company is an AWS Premier Tier Services Partner that accelerates adoption of advanced analytics, artificial intelligence, and machine learning by providing a full-stack of AWS cloud and data services.