AWS Machine Learning Blog

Use a generative AI foundation model for summarization and question answering using your own data

Large language models (LLMs) can be used to analyze complex documents and provide summaries and answers to questions. The post Domain-adaptation Fine-tuning of Foundation Models in Amazon SageMaker JumpStart on Financial data describes how to fine-tune an LLM using your own dataset. Once you have a solid LLM, you’ll want to expose that LLM to business users to process new documents, which could be hundreds of pages long. In this post, we demonstrate how to construct a real-time user interface to let business users process a PDF document of arbitrary length. Once the file is processed, you can summarize the document or ask questions about the content. The sample solution described in this post is available on GitHub.

Working with financial documents

Financial statements like quarterly earnings reports and annual reports to shareholders are often tens or hundreds of pages long. These documents contain a lot of boilerplate language like disclaimers and legal language. If you want to extract the key data points from one of these documents, you need both time and some familiarity with the boilerplate language so you can identify the interesting facts. And of course, you can’t ask an LLM questions about a document it has never seen.

LLMs used for summarization have a limit on the number of tokens (characters) passed into the model, and with some exceptions, these are typically no more than a few thousand tokens. That normally precludes the ability to summarize longer documents.

Our solution handles documents that exceed an LLM’s maximum token sequence length, and make that document available to the LLM for question answering.

Solution overview

Our design has three important pieces:

  • It has an interactive web application for business users to upload and process PDFs
  • It uses the langchain library to split a large PDF into more manageable chunks
  • It uses the retrieval augmented generation technique to let users ask questions about new data that the LLM hasn’t seen before

As shown in the following diagram, we use a front end implemented with React JavaScript hosted in an Amazon Simple Storage Service (Amazon S3) bucket fronted by Amazon CloudFront. The front-end application lets users upload PDF documents to Amazon S3. After the upload is complete, you can trigger a text extraction job powered by Amazon Textract. As part of the post-processing, an AWS Lambda function inserts special markers into the text indicating page boundaries. When that job is done, you can invoke an API that summarizes the text or answers questions about it.

Because some of these steps may take some time, the architecture uses a decoupled asynchronous approach. For example, the call to summarize a document invokes a Lambda function that posts a message to an Amazon Simple Queue Service (Amazon SQS) queue. Another Lambda function picks up that message and starts an Amazon Elastic Container Service (Amazon ECS) AWS Fargate task. The Fargate task calls the Amazon SageMaker inference endpoint. We use a Fargate task here because summarizing a very long PDF may take more time and memory than a Lambda function has available. When the summarization is done, the front-end application can pick up the results from an Amazon DynamoDB table.

For summarization, we use AI21’s Summarize model, one of the foundation models available through Amazon SageMaker JumpStart. Although this model handles documents of up to 10,000 words (approximately 40 pages), we use langchain’s text splitter to make sure that each summarization call to the LLM is no more than 10,000 words long. For text generation, we use Cohere’s Medium model, and we use GPT-J for embeddings, both via JumpStart.

Summarization processing

When handling larger documents, we need to define how to split the document into smaller pieces. When we get the text extraction results back from Amazon Textract, we insert markers for larger chunks of text (a configurable number of pages), individual pages, and line breaks. Langchain will split based on those markers and assemble smaller documents that are under the token limit. See the following code:

text_splitter = RecursiveCharacterTextSplitter(
      separators = ["<CHUNK>", "<PAGE>", "\n"],
         chunk_size = int(chunk_size),
         chunk_overlap  = int(chunk_overlap))

 with open(local_path) as f:
     doc =
 texts = text_splitter.split_text(doc)
 print(f"Number of splits: {len(texts)}")

 llm = SageMakerLLM(endpoint_name = endpoint_name)

 responses = []
 for t in texts:
     r = llm(t)
 summary = "\n".join(responses)

The LLM in the summarization chain is a thin wrapper around our SageMaker endpoint:

class SageMakerLLM(LLM):

endpoint_name: str
def _llm_type(self) -> str:
    return "summarize"
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
    response = ai21.Summarize.execute(
    return response.summary 

Question answering

In the retrieval augmented generation method, we first split the document into smaller segments. We create embeddings for each segment and store them in the open-source Chroma vector database via langchain’s interface. We save the database in an Amazon Elastic File System (Amazon EFS) file system for later use. See the following code:

documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500,
                                                chunk_overlap  = 0)
texts = text_splitter.split_documents(documents)
print(f"Number of splits: {len(texts)}")

embeddings = SMEndpointEmbeddings(
vectordb = Chroma.from_documents(texts, embeddings, 

When the embeddings are ready, the user can ask a question. We search the vector database for the text chunks that most closely match the question:

embeddings = SMEndpointEmbeddings(
vectordb = Chroma(persist_directory=persist_directory, 
docs = vectordb.similarity_search_with_score(question)

We take the closest matching chunk and use it as context for the text generation model to answer the question:

cohere_client = Client(endpoint_name=endpoint_qa)
context = docs[high_score_idx][0].page_content.replace("\n", "")
qa_prompt = f'Context={context}\nQuestion={question}\nAnswer='
response = cohere_client.generate(prompt=qa_prompt, 
answer = response.generations[0].text.strip().replace('\n', '')

User experience

Although LLMs represent advanced data science, most of the use cases for LLMs ultimately involve interaction with non-technical users. Our example web application handles an interactive use case where business users can upload and process a new PDF document.

The following diagram shows the user interface. A user starts by uploading a PDF. After the document is stored in Amazon S3, the user is able to start the text extraction job. When that’s complete, the user can invoke the summarization task or ask questions. The user interface exposes some advanced options like the chunk size and chunk overlap, which would be useful for advanced users who are testing the application on new documents.

User interface

Next steps

LLMs provide significant new information retrieval capabilities. Business users need convenient access to those capabilities. There are two directions for future work to consider:

  • Take advantage of the powerful LLMs already available in Jumpstart foundation models. With just a few lines of code, our sample application could deploy and make use of advanced LLMs from AI21 and Cohere for text summarization and generation.
  • Make these capabilities accessible to non-technical users. A prerequisite to processing PDF documents is extracting text from the document, and summarization jobs may take several minutes to run. That calls for a simple user interface with asynchronous backend processing capabilities, which is easy to design using cloud-native services like Lambda and Fargate.

We also note that a PDF document is semi-structured information. Important cues like section headings are difficult to identify programmatically, because they rely on font sizes and other visual indicators. Identifying the underlying structure of information helps the LLM process the data more accurately, at least until such time that LLMs can handle input of unbounded length.


In this post, we showed how to build an interactive web application that lets business users upload and process PDF documents for summarization and question answering. We saw how to take advantage of Jumpstart foundation models to access advanced LLMs, and use text splitting and retrieval augmented generation techniques to process longer documents and make them available as information to the LLM.

At this point in time, there is no reason not to make these powerful capabilities available to your users. We encourage you to start using the Jumpstart foundation models today.

About the author

Author pictureRandy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. In entered the Big Data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.