AWS Machine Learning Blog
Moderate, classify, and process documents using Amazon Rekognition and Amazon Textract
Many companies are overwhelmed by the abundant volume of documents they have to process, organize, and classify to serve their customers better. Examples of such can be loan applications, tax filing, and billing. Such documents are more commonly received in image formats and are mostly multi-paged and in low-quality format. To be more competitive and cost-efficient, and to stay secure and compliant at the same time, these companies must evolve their document processing capabilities to reduce processing times and improve classification accuracy in an automated and scalable way. These companies face the following challenges in processing documents:
- Performing moderation on the documents to detect inappropriate, unwanted, or offensive content
- Manual document classification, which is adopted by smaller companies, is time-consuming, error-prone, and expensive
- OCR techniques with rules-based systems aren’t intelligent enough and can’t adopt to changes in document format
- Companies that adopt machine learning (ML) approaches often don’t have resources to scale their model to handle spikes in incoming document volume
This post tackles these challenges and provides an architecture that efficiently solves these problems. We show how you can use Amazon Rekognition and Amazon Textract to optimize and reduce human efforts in processing documents. Amazon Rekognition identifies moderation labels in your document and classify them using Amazon Rekognition Custom Labels. Amazon Textract extracts text from your documents.
In this post, we cover building two ML pipelines (training and inference) to process documents without the need for any manual effort or custom code. The high-level steps in the inference pipeline include:
- Perform moderation on uploaded documents using Amazon Rekognition.
- Classify documents into different categories such as W-2s, invoices, bank statements, and pay stubs using Rekognition Custom Labels.
- Extract text from documents such as printed text, handwriting, forms, and tables using Amazon Textract.
Solution overview
This solution uses the following AI services, serverless technologies, and managed services to implement a scalable and cost-effective architecture:
- Amazon DynamoDB – A key-value and document database that delivers single-digit millisecond performance at any scale.
- Amazon EventBridge – A serverless event bus to build event-driven applications at scale using events generated from your applications, integrated software as a service (SaaS) applications, and AWS services.
- AWS Lambda – A serverless compute service that lets you run code in response to triggers such as changes in data, shifts in system state, or user actions.
- Amazon Rekognition – Uses ML to identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content.
- Amazon Rekognition Custom Labels – Uses AutoML for computer vision and transfer learning to help you train custom models to identify the objects and scenes in images that are specific to your business needs.
- Amazon Simple Storage Service (Amazon S3) – Serves as an object store for your documents and allows for central management with fine-tuned access controls.
- Amazon Step Functions – A serverless function orchestrator that makes it easy to sequence Lambda functions and multiple services into business-critical applications.
- Amazon Textract – Uses ML to extract text and data from scanned documents in PDF, JPEG, or PNG formats.
The following diagram illustrates the architecture of the inference pipeline.
Our workflow includes the following steps:
- User uploads documents into the input S3 bucket.
- The upload triggers an Amazon S3 Event Notification to deliver real-time events directly to EventBridge. The Amazon S3 events that match the “
object created
” filter defined for an EventBridge rule starts the Step Functions workflow. - The Step Functions workflow triggers a series of Lambda functions, which perform the following tasks:
- The first function performs preprocessing tasks and makes API calls to Amazon Rekognition:
- If the incoming documents are in image format (such as JPG or PNG), the function calls the Amazon Rekognition API and provide the documents as S3 objects. However, if the document is in PDF format, the function streams the image bytes when calling the Amazon Rekognition API.
- If a document contains multiple pages, the function splits the document into individual pages and saves them in an intermediate folder in the output S3 bucket before processing them individually.
- When the preprocessing tasks are complete, the function makes an API call to Amazon Rekognition to detect inappropriate, unwanted, or offensive content, and makes another API call to the trained Rekognition Custom Labels model to classify documents.
- The second function makes an API call to Amazon Textract to initiate a job for extracting text from the input document and storing it in the output S3 bucket.
- The third function stores document metadata such as moderation label, document classification, classification confidence, Amazon Textract job ID, and file path into an DynamoDB table.
- The first function performs preprocessing tasks and makes API calls to Amazon Rekognition:
You can adjust the workflow as per your requirement, for example you can add a natural language processing (NLP) capability in this workflow using Amazon Comprehend to gain insights into the extracted text.
Training pipeline
Before we deploy this architecture, we train a custom model to classify documents into different categories using Rekognition Custom Labels. In the training pipeline, we label the documents using Amazon SageMaker Ground Truth. We then use the labeled documents to train a model with Rekognition Custom Labels. In this example, we use an Amazon SageMaker notebook to perform these steps, but you can also annotate images using the Rekognition Custom Labels console. For instructions, refer to Labeling images.
Dataset
To train the model, we use the following public datasets containing W2s and invoices:
You can use another dataset relevant for your industry.
The following table summarizes the dataset splits between training and testing.
Class | Training set | Test set |
Invoices | 352 | 75 |
W-2s | 86 | 16 |
Total | 438 | 91 |
Deploy the training pipeline with AWS CloudFormation
You deploy an AWS CloudFormation template to provision the necessary AWS Identity and Access Management (IAM) roles and components of the training pipeline, including a SageMaker notebook instance.
- Launch the following CloudFormation template in the US East (N. Virginia) Region:
- For Stack name, enter a name, such as
document-processing-training-pipeline
. - Choose Next.
- In the Capabilities and transforms section, select the check box to acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create stack.
The stack details page should show the status of the stack as CREATE_IN_PROGRESS
. It can take up to 5 minutes for the status to change to CREATE_COMPLETE
. When it’s complete, you can view the outputs on the Outputs tab.
- After the stack is launched successfully, open the SageMaker console and choose Notebook instances in the navigation name.
- Look for an instance with the
DocProcessingNotebookInstance-
prefix and wait until its status is InService. - Under Actions, choose Open Jupyter.
Run the example notebook
To run your notebook, complete the following steps:
- Choose the
Rekognition_Custom_Labels
example notebook.
- Choose Run to run the cells in the example notebook in order.
The notebook demonstrates the entire lifecycle of preparing training and test images, labeling them, creating manifest files, training a model, and running the trained model with Rekognition Custom Labels. Alternatively, you can train and run the model using the Rekognition Custom Labels console. For instructions, refer to Training a model (Console).
The notebook is self-explanatory; you can follow the steps to complete training the model.
- Make a note of the
ProjectVersionArn
to provide for the inference pipeline in a later step.
For SageMaker notebook instances, you’re charged for the instance type you choose, based on the duration of use. If you’re finished training the model, you can stop the notebook instance to avoid cost of idle resources.
Deploy the inference pipeline with AWS CloudFormation
To deploy the inference pipeline, complete the following steps:
- Launch the following CloudFormation template in the US East (N. Virginia) Region:
- For Stack name, enter a name, such as
document-processing-inference-pipeline
. - For DynamoDBTableName, enter a unique DynamoDB table name; for example,
document-processing-table
. - For InputBucketName, enter a unique name for the S3 bucket the stack creates; for example,
document-processing-input-bucket
.
The input documents are uploaded to this bucket before they’re processed. Use only lowercase characters and no spaces when you create the name of the input bucket. Furthermore, this operation creates a new S3 bucket, so don’t use the name of an existing bucket. For more information, see Rules for Bucket Naming.
- For OutputBucketName, enter a unique name for your output bucket; for example, d
ocument-processing-output-bucket
.
This bucket stores the output documents after they’re processed. It also stores pages of multi-page PDF input documents after they’re split by Lambda function. Follow the same naming rules as your input bucket.
- For RekognitionCustomLabelModelARN, enter the
ProjectVersionArn
value you noted from the Jupyter notebook. - Choose Next.
- On the Configure stack options page, set any additional parameters for the stack, including tags.
- Choose Next.
- In the Capabilities and transforms section, select the check box to acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create stack.
The stack details page should show the status of the stack as CREATE_IN_PROGRESS
. It can take up to 5 minutes for the status to change to CREATE_COMPLETE
. When it’s complete, you can view the outputs on the Outputs tab.
Process a document through the pipeline
We’ve deployed both training and inference pipelines, and are now ready to use the solution and process a document.
- On the Amazon S3 console, open the input bucket.
- Upload a sample document into the S3 folder.
This starts the workflow. The process populates the DynamoDB table with document classification and moderation labels. The output from Amazon Textract is delivered to the output S3 bucket in the TextractOutput
folder.
We submitted a few different sample documents to the workflow and received the following information populated in the DynamoDB table.
If you don’t see items in the DynamoDB table or documents uploaded in the output S3 bucket, check the Amazon CloudWatch Logs for the corresponding Lambda function and look for potential errors that caused the failure.
Clean up
Complete the following steps to clean up resources deployed for this solution:
- On the CloudFormation console, choose Stacks.
- Select the stacks deployed for this solution.
- Choose Delete.
These steps don’t delete the S3 buckets, DynamoDB table, and the trained Rekognition Custom Labels model. You continue to incur charges if they’re not deleted. You should delete these resources directly via their respective service consoles if you no longer need them. Refer this page for more information about deleting Rekognition Custom Labels model.
Conclusion
In this post, we presented a scalable, secure, and automated approach to moderate, classify, and process documents. Companies across multiple industries can use this solution to improve their business and serve their customers better. It allows for faster document processing and higher accuracy, and reduces the complexity of data extraction. It also provides better security and compliance with personal data legislation by reducing the human workforce involved in processing incoming documents.
For more information, see the Amazon Rekognition Custom Labels guide, Amazon Rekognition developer guide and Amazon Textract developer guide. If you’re new to Amazon Rekognition Custom Labels, try it out using our Free Tier, which lasts 3 months and includes 10 free training hours per month and 4 free inference hours per month. Amazon Rekognition free tier includes processing 5,000 images per month for 12 months. Amazon Textract free tier also lasts for three months and includes 1,000 pages per month for Detect Document Text API.
About the Authors
Jay Rao is a Principal Solutions Architect at AWS. He enjoys providing technical and strategic guidance to customers and helping them design and implement solutions on AWS.
Uchenna Egbe is an Associate Solutions Architect at AWS. He spends his free time researching about herbs, teas, superfoods, and how he can incorporate them into his daily diet.