Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain
In today’s information age, the vast volumes of data housed in countless documents present both a challenge and an opportunity for businesses. Traditional document processing methods often fall short in efficiency and accuracy, leaving room for innovation, cost-efficiency, and optimizations. Document processing has witnessed significant advancements with the advent of Intelligent Document Processing (IDP). With IDP, businesses can transform unstructured data from various document types into structured, actionable insights, dramatically enhancing efficiency and reducing manual efforts. However, the potential doesn’t end there. By integrating generative artificial intelligence (AI) into the process, we can further enhance IDP capabilities. Generative AI not only introduces enhanced capabilities in document processing, it also introduces a dynamic adaptability to changing data patterns. This post takes you through the synergy of IDP and generative AI, unveiling how they represent the next frontier in document processing.
We discuss IDP in detail in our series Intelligent document processing with AWS AI services (Part 1 and Part 2). In this post, we discuss how to extend a new or existing IDP architecture with large language models (LLMs). More specifically, we discuss how we can integrate Amazon Textract with LangChain as a document loader and Amazon Bedrock to extract data from documents and use generative AI capabilities within the various IDP phases.
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) through easy-to-use APIs.
The following diagram is a high-level reference architecture that explains how you can further enhance an IDP workflow with foundation models. You can use LLMs in one or all phases of IDP depending on the use case and desired outcome.
- Document classification – In addition to using Amazon Comprehend, you can use an LLM to classify documents using few-shot prompting. Few-shot prompt involves prompting the language model with a few examples of different classes and a list of all possible classes, and then asking the model to classify a given piece of text from a document using one of the classes.
- Summarization – You can also use LLMs to summarize larger documents to provide precise summaries within the extraction phase of IDP. For example, a financial analysis system may involve analyzing hundreds of pages of earnings documents of a company. You can use a language model to summarize the key aspects of the earnings, enabling analysts to make business decisions.
- Standardization and in-context Q&A – In addition to extracting exact information out of documents using the Amazon Textract Analyze Document functionality, you can use LLMs to extract information that may otherwise not be explicitly inferred from a document. For example, a patient discharge summary may have the patient’s hospital admit date and discharge date but may not explicitly specify the total number of days the patient was in the hospital. You can use an LLM to deduce the total number of days the patient was admitted in the hospital, given the two dates extracted by Amazon Textract. This value can then be assigned with a well-known alias in a key-value format, also known as a normalized key, which makes consumption and post-processing even more straightforward.
- Templating and normalizations – An IDP pipeline often generates output that must conform to a specific deterministic schema. This is so that the output generated using the IDP workflow can be consumed into a downstream system, for example a relational database. The benefit of defining a deterministic schema is also to achieve key normalization so that we have a known set of keys to process in our postprocessing logic. For example, we may want to define “DOB” as a normalized key for “date of birth,” “birth date,” “birthday date,” “date born,” and so on, because documents may come with any variation of these. We use LLMs to perform such templating and normalized key-value extractions on any document.
- Spellcheck and corrections – Although Amazon Textract can extract the exact values from scanned documents (printed or handwritten), you can use a language model to identify if word misspellings and grammatical errors exist in the extracted data from. This is important in situations where the data may be extracted from poor quality or handwritten documents and used for generating marketing materials, flash reports, and so on. In addition to having a human manually review low-score extractions from Amazon Textract, you can use an LLM to augment the review process by providing correction recommendations to the human reviewer, thereby speeding up the review process.
In the following sections, we dive deep into how Amazon Textract is integrated into generative AI workflows using LangChain to process documents for each of these specific tasks. The code blocks provided here have been trimmed down for brevity. Refer to our GitHub repository for detailed Python notebooks and a step-by-step walkthrough.
Amazon Textract LangChain document loader
Text extraction from documents is a crucial aspect when it comes to processing documents with LLMs. You can use Amazon Textract to extract unstructured raw text from documents and preserve the original semi-structured or structured objects like key-value pairs and tables present in the document. Document packages like healthcare and insurance claims or mortgages consist of complex forms that contain a lot of information across structured, semi-structured, and unstructured formats. Document extraction is an important step here because LLMs benefit from the rich content to generate more accurate and relevant responses, which otherwise could impact the quality of the LLMs’ output.
LangChain is a powerful open-source framework for integrating with LLMs. LLMs in general are versatile but may struggle with domain-specific tasks where deeper context and nuanced responses are needed. LangChain empowers developers in such scenarios to build agents that can break down complex tasks into smaller sub-tasks. The sub-tasks can then introduce context and memory into LLMs by connecting and chaining LLM prompts.
LangChain offers document loaders that can load and transform data from documents. You can use them to structure documents into preferred formats that can be processed by LLMs. The AmazonTextractPDFLoader is a service loader type of document loader that provides quick way to automate document processing by using Amazon Textract in combination with LangChain. For more details on
AmazonTextractPDFLoader, refer to the LangChain documentation. To use the Amazon Textract document loader, you start by importing it from the LangChain library:
You can also store documents in Amazon S3 and refer to them using the s3:// URL pattern, as explained in Accessing a bucket using S3://, and pass this S3 path to the Amazon Textract PDF loader:
A multi-page document will contain multiple pages of text, which can then be accessed via the documents object, which is a list of pages. The following code loops through the pages in the documents object and prints the document text, which is available via the
Amazon Comprehend and LLMs can be effectively utilized for document classification. Amazon Comprehend is a natural language processing (NLP) service that uses ML to extract insights from text. Amazon Comprehend also supports custom classification model training with layout awareness on documents like PDFs, Word, and image formats. For more information about using the Amazon Comprehend document classifier, refer to Amazon Comprehend document classifier adds layout support for higher accuracy.
When paired with LLMs, document classification becomes a powerful approach for managing large volumes of documents. LLMs are helpful in document classification because they can analyze the text, patterns, and contextual elements in the document using natural language understanding. You can also fine-tune them for specific document classes. When a new document type introduced in the IDP pipeline needs classification, the LLM can process text and categorize the document given a set of classes. The following is a sample code that uses the LangChain document loader powered by Amazon Textract to extract the text from the document and use it for classifying the document. We use the Anthropic Claude v2 model via Amazon Bedrock to perform the classification.
In the following example, we first extract text from a patient discharge report and use an LLM to classify it given a list of three different document types—
PRESCRIPTION. The following screenshot shows our report.
We use the following code:
The code produces the following output:
The provided document is a DISCHARGE_SUMMARY
Summarization involves condensing a given text or document into a shorter version while retaining its key information. This technique is beneficial for efficient information retrieval, which enables users to quickly grasp the key points of a document without reading the entire content. Although Amazon Textract doesn’t directly perform text summarization, it provides the foundational capabilities of extracting the entire text from documents. This extracted text serves as an input to our LLM model for performing text summarization tasks.
Using the same sample discharge report, we use
AmazonTextractPDFLoader to extract text from this document. As before, we use the Claude v2 model via Amazon Bedrock and initialize it with a prompt that contains the instructions on what to do with the text (in this case, summarization). Finally, we run the LLM chain by passing in the extracted text from the document loader. This runs an inference action on the LLM with the prompt that consists of the instructions to summarize, and the document’s text marked by
Document. See the following code: