AWS Machine Learning Blog
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability
An intelligent document processing (IDP) project typically combines optical character recognition (OCR) and natural language processing (NLP) to automatically read and understand documents. Customers across all industries run IDP workloads on AWS to deliver business value by automating use cases such as KYC forms, tax documents, invoices, insurance claims, delivery reports, inventory reports, and more. IDP workflows on AWS can help you extract business insights from your documents, reduce manual effort, and process documents faster and with higher accuracy.
Building a production-ready IDP solution in the cloud requires a series of trade-offs between cost, availability, processing speed, and sustainability. This post provides guidance and best practices on how to improve the sustainability of your IDP workflow using Amazon Textract, Amazon Comprehend, and the IDP Well-Architected Custom Lens.
The AWS Well-Architected Framework helps you understand the benefits and risks of decisions made while building workloads on AWS. The AWS Well-Architected Custom Lenses complement the Well-Architected Framework with more industry-, domain-, or workflow-specific content. By using the Well-Architected Framework and the IDP Well-Architected Custom Lens, you will learn about operational and architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud.
The IDP Well-Architected Custom Lens provides you with guidance on how to address common challenges in IDP workflows that we see in the field. By answering a series of questions in the Well-Architected Tool, you will be able to identify the potential risks and address them by following the improvement plan.
This post focuses on the Sustainability pillar of the IDP custom lens. The Sustainability pillar focuses on designing and implementing the solution to minimize the environmental impact of your workload and minimize waste by adhering to the following design principles: understand your impact, maximize resource utilization and use managed services, and anticipate change and prepare for improvements. These principles help you stay focused as you dive into the focus areas: achieving business results with sustainability in mind, effectively managing your data and its lifecycle, and being ready for and driving continuous improvement.
Design principles
The Sustainability pillar focuses on designing and implementing the solution through the following design principles:
- Understand your impact – Measure the sustainability impact of your IDP workload and model the future impact of your workload. Include all sources of impact, including the impact of customer use of your products. This also includes the impact of IDP that enables digitization and allows your company or customers to complete paperless processes. Establish key performance indicators (KPIs) for your IDP workload to evaluate ways to improve productivity and efficiency while reducing environmental impact.
- Maximize resource utilization and use managed services – Minimize idle resources, processing, and storage to reduce the total energy required to run your IDP workload. AWS operates at scale, so sharing services across a broad customer base helps maximize resource utilization, which maximizes energy efficiency and reduces the amount of infrastructure needed to support IDP workloads. With AWS managed services, you can minimize the impact of your IDP workload on compute, networking, and storage.
- Anticipate change and prepare for improvements – Anticipate change and support the upstream improvements your partners and suppliers make to help you reduce the impact of your IDP workloads. Continuously monitor and evaluate new, more efficient hardware and software offerings. Design for flexibility to lower barriers for introducing changes and allow for the rapid adoption of new efficient technologies.
Focus areas
The design principles and best practices of the Sustainability pillar are based on insights gathered from our customers and our IDP technical specialist communities. You can use them as guidance to support your design decisions and align your IDP solution with your business and sustainability requirements.
The following are the focus areas for sustainability of IDP solutions in the cloud: achieve business results with sustainability in mind, effectively manage your data and its lifecycle, and be ready for and drive continuous improvement.
Achieve business results with sustainability in mind
To determine the best Regions for your business needs and sustainability goals, we recommend the following steps:
- Evaluate and shortlist potential Regions – Start by shortlisting potential Regions for your workload based on your business requirements, including compliance, cost, and latency. Newer services and features are deployed to Regions gradually. Refer to List of AWS Services Available by Region to check which Regions have the services and features you need to run your IDP workload.
- Choose a Region powered by 100% renewable energy – From your shortlist, identify Regions close to Amazon’s renewable energy projects and Regions where, in 2022, the electricity consumed was attributable to 100% renewable energy. Based on the Greenhouse Gas (GHG) Protocol, there are two methods for tracking emissions from electricity production: market-based and location-based. Companies can choose one of these methods based on their sustainability policies to track and compare their emissions from year to year. Amazon uses the market-based model to report our emissions. To reduce your carbon footprint, select a Region where, in 2022, the electricity consumed was attributable to 100% renewable energy.
Effectively manage your data and its lifecycle
Data plays a key role throughout your IDP solution. Starting with the initial data ingestion, data is pushed through various stages of processing, and finally returned as output to end-users. It’s important to understand how data management choices will affect the overall IDP solution and its sustainability. Storing and accessing data efficiently, in addition to reducing idle storage resources, results in a more efficient and sustainable architecture. When considering different storage mechanisms, remember that you’re making tradeoffs between resource efficiency, access latency, and reliability. This means you’ll need to select your management pattern accordingly. In this section, we discuss some best practices for data management.
Create and ingest only relevant data
To optimize your storage footprint for sustainability, evaluate what data is needed to meet your business objectives and create and ingest only relevant data along your IDP workflow.
Store only relevant data
When designing your IDP workflow, consider for each step in your workflow which intermediate data outputs need to be stored. In most IDP workflows, it’s not necessary to store the data used or created in each intermediate step because it can be easily reproduced. To improve sustainability, only store data that is not easily reproducible. If you need to store intermediate results, consider whether they qualify for a lifecycle rule that archives and deletes them more quickly than data with stricter retention requirements.
Preserve data across computing environments such as development and staging. Implement mechanisms to enforce a data lifecycle management process including archiving and deletion and continuously identify unused data and delete it.
To optimize your data ingest and storage, consider the optimal data resolution that satisfies the use case. Amazon Textract requires at least 150 DPI. If your document isn’t in a supported Amazon Textract format (PDF, TIFF, JPEG, and PNG) and you need to convert it, experiment to find the optimal resolution for best results rather than choosing the maximum resolution.
Use the right technology to store data
For IDP workflows, most of the data is likely to be documents. Amazon Simple Storage Service (Amazon S3) is an object storage built to store and retrieve any amount of data from anywhere, making it well suited for IDP workflows. Using different Amazon S3 storage tiers is a key component of optimizing storage for sustainability.
When considering different storage mechanisms, remember that you’re making trade-offs between resource efficiency, access latency, and reliability. That means you’ll need to select your management pattern accordingly. By storing less volatile data on technologies designed for efficient long-term storage, you can optimize your storage footprint. For archiving data or storing data that changes slowly, Amazon S3 Glacier and Amazon S3 Glacier Deep Archive are available. Depending on your data classification and workflow, you can choose Amazon S3 One Zone-IA, which reduces power and server capacity by storing data within a single Availability Zone.
Actively manage your data lifecycle according to your sustainability goals
Managing your data lifecycle means optimizing your storage footprint. For IDP workflows, first identify your data retention requirements. Based on to your retention requirements, create Amazon S3 Lifecycle configurations that automatically transfer objects to a different storage class based on your predefined rules. For data with no retention requirements and unknown or changing access patterns, use Amazon S3 Intelligent-Tiering to monitor access patterns and automatically move objects between tiers.
Continuously optimize your storage footprint by using the right tools
Over time, the data usage and access pattern in your IDP workflow may change. Tools like Amazon S3 Storage Lens deliver visibility into storage usage and activity trends, and even make recommendations for improvements. You can use this information to further lower the environmental impact of storing data.
Enable data and compute proximity
As you make your IDP workflow available to more customers, the amount of data traveling over the network will increase. Similarly, the larger the size of the data and the greater the distance a packet must travel, the more resources are required to transmit it.
Reducing the amount of data sent over the network and optimizing the path a packet takes will result in more efficient data transfer. Setting up data storage close to data processing helps optimize sustainability at the network layer. Ensure that the Region used to store the data is the same Region where you have deployed your IDP workflow. This approach helps minimize the time and cost of transferring data to the computing environment.
Be ready for and drive continuous improvement
Improving sustainability for your IDP workflow is a continuous process that requires flexible architectures and automation to support smaller, frequent improvements. When your architecture is loosely coupled and uses serverless and managed services, you can enable new features without difficulty and replace components to improve sustainability and gain performance efficiencies. In this section, we share some best practices.
Improve safely and continuously through automation
Using automation to deploy all changes reduces the potential for human error and enables you to test before making production changes to ensure your plans are complete. Automate your software delivery process using continuous integration and continuous delivery (CI/CD) pipelines to test and deploy potential improvements to reduce effort and limit errors caused by manual processes. Define changes using infrastructure as code (IaC): all configurations should be defined declaratively and stored in a source control system like AWS CodeCommit, just like application code. Infrastructure provisioning, orchestration, and deployment should also support IaC.
Use serverless services for workflow orchestration
IDP workflows are typically characterized by high peaks and periods of inactivity (such as outside of business hours), and are mostly driven by events (for example, when a new document is uploaded). This makes them a good fit for serverless solutions. AWS serverless services can help you build a scalable solution for IDP workflows quickly and sustainably. Services such as AWS Lambda, AWS Step Functions, and Amazon EventBridge help orchestrate your workflow driven by events and minimize idle resources to improve sustainability.
Use an event-driven architecture
Using AWS serverless services to implement an event-driven approach will allow you to build scalable, fault-tolerant IDP workflows and minimize idle resources.
For example, you can configure Amazon S3 to start a new workflow when a new document is uploaded. Amazon S3 can trigger EventBridge or call a Lambda function to start an Amazon Textract detection job. You can use Amazon Simple Notification Service (Amazon SNS) topics for event fanout or to send job completion messages. You can use Amazon Simple Queue Service (Amazon SQS) for reliable and durable communication between microservices, such as invoking a Lambda function to read Amazon Textract output and then calling a custom Amazon Comprehend classifier to classify a document.
Use managed services like Amazon Textract and Amazon Comprehend
You can perform IDP using a self-hosted custom model or managed services such as Amazon Textract and Amazon Comprehend. By using managed services instead of your custom model, you can reduce the effort required to develop, train, and retrain your custom model. Managed services use shared resources, reducing the energy required to build and maintain an IDP solution and improving sustainability.
Review AWS blog posts to stay informed about feature updates
There are several blog posts and resources available to help you stay on top of AWS announcements and learn about new features that may improve your IDP workload.
AWS re:Post is a community-driven Q&A service designed to help AWS customers remove technical roadblocks, accelerate innovation, and enhance operations. AWS re:Post has over 40 topics, including a community dedicated to AWS Well-Architected. AWS also has service-specific blogs to help you to stay up to date for Amazon Textract and Amazon Comprehend.
Conclusion
In this post, we shared design principles, focus areas, and best practices for optimizing sustainability in your IDP workflow. To learn more about sustainability in the cloud, refer to the following series on Optimizing your AWS Infrastructure for Sustainability, Part I: Compute, Part II: Storage, and Part III: Networking.
To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:
|
AWS is committed to the IDP Well-Architected Lens as a living tool. As IDP solutions and related AWS AI services evolve, and as new AWS services become available, we will update the IDP Well-Architected Lens accordingly.
To get started with IDP on AWS, refer to Guidance for Intelligent Document Processing on AWS to design and build your IDP application. For a deeper dive into end-to-end solutions that cover data ingestion, classification, extraction, enrichment, verification and validation, and consumption, refer to Intelligent document processing with AWS AI services: Part 1 and Part 2. Additionally, Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain covers how to extend a new or existing IDP architecture with large language models (LLMs). You’ll learn how you can integrate Amazon Textract with LangChain as a document loader, use Amazon Bedrock to extract data from documents, and use generative AI capabilities within the various IDP phases.
If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.
About the Author
Christian Denich is a Global Customer Solutions Manager at AWS. He is passionate about automotive, AI/ML and developer productivity. He supports some the world’s largest automotive brands on their cloud journey, encompassing cloud and business strategy as well as technology. Before joining AWS, Christian worked at BMW Group in both hardware and software development in various projects including connected navigation.