This Guidance provides best practices for building and deploying an intelligent document processing (IDP) architecture that scales with workload demands. The code provided automates the creation of machine learning (ML) resources which will reduce developer friction associated with the time-consuming and error-prone tasks of standing up a high-quality IDP environment. This will reduce the time it takes to deliver a proof-of-concept for IDP workflows and help ensure adherence to architectural best practices.
Please note: [Disclaimer]
Architecture Diagram
Step 1
Document processing workflows are developed using the AWS Cloud Development Kit (AWS CDK).
Step 2
AWS CDK generates a stack in AWS CloudFormation that deploys templates for the resources required to execute the workflows.
Step 3
The CloudFormation template creates an Amazon Simple Storage Service (Amazon S3) bucket where documents are uploaded for processing.
Step 4
Document uploads trigger an AWS Step Functions workflow to orchestrate document processing functions. Depending on your specific use case, you can choose one or more workflows available in the sample code.
Step 5
The Step Functions workflow begins with Amazon Textract extracting the text from the document.
Step 6
An AWS Lambda function processes the output of Amazon Textract and generates a CSV file and key/value pairs of extracted text.
Step 7
Amazon Comprehend uses the extracted text to classify documents by type. The sample code includes custom classifiers.
Step 8
Depending on the deployed workflow, Amazon DynamoDB or Amazon Aurora persists the data.
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
This Guidance comes with a Git repository that contains all the artifacts required to deploy the architecture.
-
Security
Amazon S3 encrypts your data by default using Amazon S3-managed encryption keys. You may also use AWS Key Management Service (AWS KMS), a managed service that allows you to use your own cryptographic keys to protect your data. Data shared between services in your account never leaves your account. You can use the Amazon Comprehend console or APIs to detect personally identifiable information (PII) in English text documents. With PII detection, you have the choice of locating the PII entities or redacting the PII entities in the text.
-
Reliability
The Guidance recommends and has separate AWS CDK components for each Lambda function that can be used as microservices. The serverless, event-driven architecture in addition to retry and exponential back off features make this architecture scalable. The Lambda functions included in the sample code have logging enabled, set with the default mode of "DEBUG.” You can view these logs in Amazon CloudWatch, through which you can also monitor and set alarms for specific log events.
-
Performance Efficiency
The Guidance deploys a serverless event-driven architecture that scales according to traffic patterns.
-
Cost Optimization
This Guidance and the associated workshop use AWS Cloud9 to create instances to install Docker and deploy the AWS CDK stacks. We recommend using the cost-saving setting that prompts the environment to auto-hibernate after thirty minutes of no activity. The Step Functions workflow is initiated only when the document is uploaded to a particular Amazon S3 location. The workshop contains an estimate on total cost of execution and has a clean-up section to destroy the deployed stack.
-
Sustainability
This Guidance allows you to maximize your utilization and right-size your implementation by using Step Functions, which only runs when your documents are being processed. This allows you to use resources only when needed and conserve energy consumption of the underlying infrastructure. By using managed services like AWS Textract and Amazon Comprehend, you can operate at scale and share the underlying resources, which allows you to further maximize resource usage.
Implementation Resources
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
Use machine learning to automate and process documents at scale
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.