AWS Partner Network (APN) Blog

Leveraging Data Reply DocExMachina for Customizing Intelligent Document Processing with AWS

By Stefania Massetti, Data Scientist – Data Reply
By Fabrizio Cavallaro, Big Data Engineer – Data Reply
By Salman Taherian, AI/ML Specialist, Partner Solutions Architect – AWS

Data Reply

Intelligent document processing (IDP) falls under the intelligent process automation (IPA) of business tasks, and specifically involves extracting value from documents in an automated manner. Adopting IPA can be an enabler for speeding up business procedures, but the drawback is that managing different types of documents is not handy, as each of them may require a specific processing.

In this post, we’ll present how customers can leverage Data Reply’s reference end-to-end architecture and overcome this pain point. This solution is based on Amazon Textract and Amazon Comprehend, and we’ll describe how they are used for a class of documents. We’ll show how the same rationale can be easily adapted to others, making this pipeline document agnostic and customizable on Amazon Web Services (AWS).

Data Reply is part of the Reply Group, an AWS Premier Tier Services Partner with 13 AWS Competencies and 16 service specializations. Data Reply is focused on helping clients to deliver business value and differentiation through advanced analytics and artificial intelligence (AI) and machine learning (ML) on AWS.

Solution Overview

The following diagram shows the architecture of Data Reply’s IDP pipeline, called DocExMachina (DEM):

  • Data extraction (section A)
  • Document classification (section B)
  • Extract, transform, load (ETL) modelling (section C)
  • Query (section D)


Figure 1 – DEM reference architecture.

Next, we describe the technical details of these steps to see how they connect with one another. Two key ML services are used:

  • Amazon Textract is an intelligent data extraction service capable of extracting text forms and tables from documents in many formats (PNG, JPEG, TIFF, PDF).
  • Amazon Comprehend implements natural language processing (NLP) capabilities like detect entities, extracts sentiment from text, and classifies documents based on their content with user-defined labels.


The following solution can extract value from payrolls and receipts. Since Amazon Comprehend custom classifier can work with just one language, Data Reply trained it using Italian documents using multi-class mode and one doc per line in .CSV format.

The team wanted to underline how the same solution applies to other types of structured documents based on customer needs, just by creating a new custom document classifier and adjusting the ETL phase to extract the desired information. Moreover, this is a modular pipeline so if your specific use case has just one type of documents and you don’t need the classification, you can remove that step and it will work fine.

One more clarification before moving on to the description of DEM—when we talk about structured documents we are referring to documents having relevant information formatted in key-value pairs.

Data Extraction Phase

We start from uploading documents (payrolls or receipts) inside the rawData bucket in Amazon Simple Storage Service (Amazon S3). For each document ingested, the AWS Lambda function documentInitializer will be triggered to perform the following operations on the file:

  • Creating a folder partitioned by date (year=xxxx/month=xx/day=xx) inside rawData, moving the file there.
  • Entering file name, status INGESTED, and batchId TO_ASSIGN in the column in the Amazon DynamoDB status Table. This table will be updated during the pipeline to help orchestrating the processing steps the documents will go through.

This INSERT operation into the status table generates a DynamoStream to the Lambda function extractionInitializer, which receives the file names and sends an asynchronous request to Amazon Textract using the asynchronous StartDocumentAnalysis API.

At the end of the analysis, Textract sends a notification via Amazon Simple Notification Service (SNS) to the Amazon Simple Queue Service (SQS) queue, to pass to the Lambda function extractorTexts the processed document’s jobId, which is necessary to retrieve a Textract response (a JSON file) using the GetDocumentAnalysis API.

From this JSON file, the Lambda function saves in the bucket stageData three CSV files:

  • year=xxxx/month=xx/day=xx/filename=xx/raw/raw.csv – raw text extracted by Textract
  • year=xxxx/month=xx/day=xx/filename=xx/forms/forms.csv – (key,value) pairs extracted by Textract from forms
  • year=xxxx/month=xx/day=xx/filename=xx/tables/tables.csv – tables entries extracted by Textract from tables

Finally, the extractorTexts updates the status for each file in the DynamoDB to TEXT_EXTRACTED.

Note that calling the StartDocumentAnalysis API instead of the simple StartDocumentTextDetection was justified in this case since Data Reply had documents that had forms and tables in every page. To avoid unnecessary charges, one might implement an algorithm to adopt this type of API only when useful and stick to raw text detection for the other pages.

Document Classification Phase

So far, we have treated each file individually, but here we switch to batch mode. We do so since, unlike Textract asynchronous operations that are faster, Amazon Comprehend custom classifier asynchronous jobs take about five minutes to be up and running.

Using batches, we try to make the most of this timing by completing as many documents as possible at once. To achieve this goal:

  • Every minute the Amazon EventBridge event called RetrieveCallForBatches generates a trigger to the handlingDocument Lambda function, which performs a query on the statusTable to extract documents that are in status TEXT_EXTRACTED and batchId TO_ASSIGN.
  • A folder with the current timestamp is generated and the raw.csv file of each document is copied within this folder.
  • The value of the batchId column of these documents is updated with the timestamp just computed, to group together all the documents that are ready to be classified.
  • The same Lambda function initializes the execution of the AWS Step Function by sending as input the Amazon S3 unique resource identifier (URI) of the folder created with the raw.csv files inside.

Inside the Step Function

  • The StartClassificationDocument Lambda function receives the information of this folder location and launches the classification using the StartDocumentClassificationJob API.
  • A check loop waits until the classification job is over, calling the DescribeDocumentClassificationJob API.
  • Once the classification is over for a group of documents, the GetClassificationDocument Lambda function receives the S3 URI location, and reads the classification scores and associated labels from the output.tar.gz file saved in stageData.

Documents whose classification confidence score is below a certain threshold are marked as TO_REVIEW in the status table, and a notification email is sent via SNS to the user to allow manual intervention on the document category.

The threshold at first could be the average score_prediction for test files that were classified correctly during training, minus a delta value or the average score_prediction for test files that were classified uncorrectly. It can then be adjusted according to the error rate of the classifier once it has been used for a certain number of records.

For the other docs, we update the status as CLASSIFIED and we add the label (RECEIPT or PAYROLL) in the status table, and then proceed to the modelling phase.


Figure 2 – Status table after modeling.

ETL Modelling Phase

The document classification described above is used to send each file to the specific ETL modelling; in this case, an ETL for payrolls and another one for receipts, since we are creating different data models based on the relevant information we can extract from each type of document.

For each data model, we have two ETL stages executed using AWS Glue Python shell jobs. In the first stage, only form and table information are considered, loading it from the stageData bucket (the forms.csv and tables.csv previously appointed).

In this phase, we look for data we defined as relevant to extract (this is a preliminary step specific to a business use case) and save a temporary parquet in stageData. However, we might have:

  • Missing data: If we did not find the key we expected (for payrolls, we did not find the employee name key in the form.csv file because it has another name in that particular document) and consequently also the value is not available.
  • Wrong data: It might happen that Amazon Textract correctly extracted the key in OCR phase, so we have it in the output JSON but associated it to the wrong value due to particular geometries in the file, as there is no standard for payroll/receipt formats.

To overcome these problems, a second ETL job performs a search for the correct key using Levenshtein distance between the keys extracted and a set of standard ones user-defined, then it exploits geometric information provided by Textract on the position of lines and words in the document. This is to associate the most probable value (for example, we know that if we identify the position of the key in line 1 it’s probable that its associated value is in the next line).

Together with filters based on the type of data expected (employee name is expected to be a word and not a number, for example), then two suitable proposals for a certain field are derived. These proposals are saved in a second Parquet file within the businessData bucket. Together with the temporary Parquet, this is copied from stageData to this final folder.

The file status is eventually updated in MODELED in the status table and the Step Function ends.

Query Phase

Am AWS Glue crawler is activated once a day and creates partitions by scanning the businessData bucket in the parent directory containing the Parquet files, divided per document class.

The data is automatically inserted into AWS Glue Data Catalog and can be queried in Amazon Athena. Users can check out the results of the phase one modelling, joining it with the proposed table of phase two in case of missing data, to be able to fill blank values.


Figure 3 – Sample data reconstruction query.



Figure 4 – Sample output from the statement in Figure 3.



Figure 5 – Screenshot of the original document for data verification.


In this post, we demonstrated how Data Reply’s intelligent document processing pipeline solution DocExMachina (DEM) can be customized to meet specific users’ needs.

The data extraction phase is a robust process that ingests data into Amazon S3 and utilizes AWS services including Amazon Textract to extract relevant information from documents. The document classification phase takes advantage of Amazon Comprehend custom classifier asynchronous jobs to assign a label to files based on their text, in order to send them to the correct ETL pipeline for data modelling.

The DEM pipeline can extract data from structured documents to be accessed through Amazon Athena. Further integrations to automate the entire business process include:

  • Amazon A2I to enable “human in the loop” for Amazon Textract can monitor the result from document analysis and check it against user defined activation conditions, like Amazon Textract inference confidence score, starting a human review loop if needed.
  • Amazon Comprehend model retraining using Flywheel, whose scope is to orchestrate MLOps phases to keep the document classifier up to date with new data.

To see the IDP solution in action, contact Data Reply for a quick chat and demo. You can also learn more Data Reply in AWS Marketplace.


Data Reply – AWS Partner Spotlight

Data Reply is part of the Reply Group, an AWS Premier Tier Services Partner with 13 AWS Competencies and 16 service specializations. Data Reply is focused on helping clients to deliver business value and differentiation through advanced analytics and AI/ML on AWS.

Contact Data Reply | Partner Overview | AWS Marketplace | Case Studies