AWS Public Sector Blog
How federal agencies can optimize document processing using advanced AI with human oversight
Federal agencies typically collect, manage, use, and distribute a wide array of documents. Storing and distributing federal agency documents is often a complicated process; documents can range from structured formats to free-flowing documentation with personal identifiable information (PII) that needs careful redaction. The various types of PII included in images, handwritten notes, and signatures add layers of complexity to processing tasks that identify PII and sensitive information.
Because federal agencies cover a wide breadth of domains, it is challenging to develop a one-size-fits-all approach for document processing.
In this post, we explore an example of how a federal agency can use Amazon Web Services (AWS) to design and deploy a solution that addresses this document processing challenge.
The challenge of storing and distributing federal agency documents
With an incredible volume and variety of documents passing through their workflow, federal agencies typically face multiple challenges when it comes to processing, including:
- Workflow diversity– Each document type has a different process.
- Volume and throughput – The queue of incoming documents can vary by organization with differing document size, timing, regulation, and industry standards adding complexities to data processing.
- Selective redaction – Agencies have specific needs and standards for redacting documents, complicating the processing workflow.
- Cost efficiency and sustainability – The compute and human resource costs to manually process documents are a huge constraint.
- Adaptability – Technology is developing quickly in this space. Agencies need a flexible and adaptable approach for model training, tuning, and deploying to accurately process documents.
How federal agencies can solve the processing challenge
A federal agency in the transportation sector partnered with Halvik, Accenture Federal Services, and AWS for federal civilian agencies to design and deploy a solution that addresses this document processing challenge.
To tackle the industry-specific challenges associated with document processing, this solution employs a human-in-the-loop approach, taking advantage of advanced and adaptable artificial intelligence (AI) to identify, redact, and learn from human review. Versatile, efficient, and secure document processing reduces the burden on federal agencies.
This multistage document processing solution addresses the complexities of handling different document types by allowing documents to be classified as one of the following subtypes:
- Standard forms – Automated processing for documents with fixed structure and fields.
- Mixed format – Advanced parsing for documents with both fixed fields and more unstructured data.
- Handwritten – Specialized optical character recognition (OCR) technology for identifying and redacting PII in handwritten texts.
- Comprehensive – A flexible approach for documents that combines the standard form, mixed format, and handwritten elements, such as signed receipts and reports.
Designing modules to process sensitive data
The following designed modules enable agencies to accurately and securely process sensitive data across different document layouts:
Ingest and extract
Bring the input data files into the cloud storage solution. You can use AI-enabled services such as Amazon Textract to extract information into a plain text format. Use an intelligence service such as Amazon Comprehend or a custom trained model to extract named entities along with metadata from the raw text. This output is stored in a standard file format such as JSON and includes text content, confidence scores, and the file location, which are fed into the next stage.
Some of the data produced during this process can be helpful in other use cases, such as to help summarize, search, and update Q&A documents, so it gets stored along with the original document.
Identify and redact
Customized AWS Lambda functions are triggered to apply rule-based redaction. For example, if any text is identified with a confidence score of more than 90 percent and falls within a certain PII category—such as Social Security number or first name—it can be automatically redacted.
Review and validate
Human oversight and verification is critical throughout this process. In a human-in-the-loop approach, a human expert reviews and validates the areas that are highlighted for redaction. You can modify this stage to include click and redact, un-redact, white-fonting (such as removing hidden key words), and more.
This stage maintains processing accuracy and compliance with agency-specific requirements. As the system is used and the model continues to learn from human feedback, the level of effort required during this stage should decrease.
Establish the fault tolerance and continuous learning
By taking a topic notification approach using Amazon Simple Notification Service (Amazon SNS), each document can be processed at least once, making the system fault tolerant.
Automating the manual review and redaction process is expected to significantly reduce the burden on the responsible division. Deployed as an enterprise service, the solution identifies and redacts information during the data ingestion process and imports details into the environment for broader data products and AI and machine learning (ML) use cases.
How AWS enhances the system
As AWS works to make this system available as a service, we collect metrics on usage and efficiencies achieved. We continue to enhance this system by incorporating and fine-tuning foundation models (FMs) for specific scenarios and exploring other use cases for data-driven decision-making, including:
- Human-enabled automated redaction – AI/ML models are trained to automatically identify and redact sensitive information from documents, such as PII, confidential data, or classified content. This significantly streamlines the redaction process and reduces manual effort.
- Customizable models – Models can be tailored to the specific redaction needs of public sector organizations, considering their unique data types, policies, and regulatory requirements. This allows for a more accurate and context-aware redaction process.
- Continual learning and feedback loop – Models are continually trained and improved as more data becomes available and human experts provide feedback on the model’s performance. This feedback loop keeps the redaction process accurate and up-to-date with evolving data patterns and regulations.
- Scalability – This solution can handle large volumes of documents efficiently, making it suitable for organizations with high document throughput requirements. As the confidence score of the solution improves, it can process low-risk documents automatically while requiring that a human verify high-risk or sensitive documents.
- Security and privacy – You can implement more security measures, such as encryption, access controls, and auditing, to protect sensitive data during the redaction process.
- Integration – You can deploy the solution as a service and integrate it with existing document management systems and workflows, seamlessly adopting it with minimal disruptions to existing processes.
- Savings – By saving time through the models, organizations can apply their workforce to other domain-specific responsibilities.
By combining advanced AI with human-in-the-loop verification, agencies can achieve greater efficiency, accuracy, and scalability while maintaining high levels of security, compliance, and human oversight. Refer to Figure 1 to explore the notional document architecture that keeps a human in the loop.
As shown in Figure 1, the document redaction process begins when a user uploads the sensitive document into an Amazon Simple Storage Service (Amazon S3) bucket. The document is then prepared for redaction using Amazon Textract or Amazon Comprehend and Lambda. Sensitive text is identified, tagged, and sent to a human for review. The human reviews the identified sensitive text and makes any edits. Lambda then performs a final redaction of the sensitive text and returns it within the environment.
Conclusion
Both Halvik and Accenture Federal have worked with AWS for more than a decade to help organizations derive value from their applications and data. The collaboration between the companies helped federal organizations accelerate their digital advancement and achieve greater business value and impact.
If you have further questions, reach out to our authors and explore how AWS helps US federal civilian agencies meet mission-critical objectives.
Contributing Authors: Pavan Devulapalli, Sanjay Midha, Joyce Padela, Gopal Rajanala