Implementing a unified search platform for educational content on Amazon OpenSearch Service

This post discusses how Amazon Web Services (AWS) can help you set up a unified search platform for educational content using Amazon OpenSearch Service. Amazon OpenSearch Service is a managed service that makes it easy to deploy, operate, and scale OpenSearch clusters in the AWS Cloud.

With digital learning becoming increasingly popular, managing a vast amount of educational data can be challenging. For students, accessing content in the format that best suits their needs makes faster and more effective learning possible. A unified search platform enables users to search across multiple indexes at once, providing a centralized view of all available resources. Using Amazon OpenSearch Service, education customers can build robust, scalable, and efficient search solutions for their students. By using the OpenSearch engine for querying and indexing, you can provide access to various education resources, such as videos, transcripts, research papers and more.

Data ingestion and storage

Educational content comes in various formats, including videos, audio, PDFs, HTML pages, and more. To efficiently manage these resources, you can use Amazon Simple Storage Service (Amazon S3). Amazon S3 is an object storage service built to store and retrieve any amount of data from anywhere. It offers industry-leading durability, availability, performance, security, and virtually unlimited scalability at very low costs, making it an ideal choice for storing large volumes of educational materials.

To maintain a structured approach to different resources, you can organize file types into separate folders within an S3 bucket. This structure allows you to easily trigger tailored AWS Lambda functions for each format type. You can do this by setting up a trigger for Lambda that specifies an S3 bucket with a prefix to the object key. This approach streamlines your workflow and provides the appropriate tailored processing to each content type.

Solution overview

To create a unified search platform for educational content, follow these high-level steps:

1. Use an S3 bucket as the ingestion point. Each different file format can be uploaded to a different folder inside the bucket.

2. This upload into the S3 bucket triggers a tailored Lambda function based on the content type:

a. If the file is a video or audio file, the Lambda function will trigger Amazon Transcribe to create a transcription and then call Amazon Bedrock to summarize the transcription.

b. If the file is an HTML page, extract the text from the HTML page using the Beautiful Soup library and then summarize the content by calling the Amazon Bedrock function.

c. If the file is a slide presentation, extract the text from the slide and presenter notes using the python-pptx library and then summarize the content by calling the Amazon Bedrock function.

d. If the file is a PDF file, the Lambda function will trigger Amazon Textract to extract the text and then call Amazon Bedrock to summarize.

3. After all the files have been processed, the resulting summary is saved into Amazon DynamoDB.

4. When that data comes into DynamoDB, use the DynamoDB streams feature to trigger a Lambda function.

5. The Lambda function is triggered to index the data.

6. This indexed data is then saved into Amazon OpenSearch Service domain and made available to search.

The following diagram shows this workflow.

Figure 1. Visual representation of a potential workflow to process, index, and search the different types of content.

Data processing and indexing

To make the stored content searchable and more accessible, it must undergo processing and indexing. Lambda functions play a crucial role, extracting metadata and content from the files stored in Amazon S3. These functions can be programmed to handle various tasks, such as extracting text from PDFs, transcribing audio and video files, and summarizing text using foundation models (FMs).

Once the data is extracted and processed, it is saved to a database. A second Lambda function then structures this data and sends it to OpenSearch for indexing. This process creates a powerful search capability across all the educational content so that users can quickly find relevant materials based on their queries.

Data processing

For the data processing, Lambda functions are automatically triggered when new files are uploaded to Amazon S3. For each file, you first need to extract the text or transcription, summarize it, and save this summary into a database. This information is indexed later. The method for extracting text varies depending on the original file format. Each file type requires a tailored processing function. The correct processing function can be triggered based on the folder structure in the S3 bucket using a trigger based on prefixes or the file extension with a trigger based on the file suffix.

1. Extract text or transcriptions

Efficient processing is crucial for making video and audio content on an educational platform searchable and accessible. Amazon Transcribe, a fully managed automatic speech recognition (ASR) service, can transcribe these audio and video files. The Amazon Transcribe API, StartTranscriptionJob, can be automatically engaged through Lambda functions to initiate transcription jobs for each uploaded file. A full sample code for AWS SDKs can be found at Transcribing with the AWS SDKs.

Amazon Transcribe can identify dominant spoken languages and handle custom vocabularies for domain-specific terminology. It also offers speaker diarization for multi-speaker content and automatic content redaction for sensitive information. Throughout the transcription process, the Lambda function can continuously monitor the job status, ensuring complete transcription of the content. Once completed, results are retrieved from the specified output location in Amazon S3, typically in JSON format.

When dealing with books, PDFs, or syllabi, you can use a Lambda function that uses Amazon Textract to extract text. Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, layout elements, and data from scanned documents. It goes beyond optical character recognition (OCR) to identify and extract specific data, such as key-value pairs, tables, and forms, preserving the content’s logical structure.

The Lambda function interacts with the Amazon Textract API, initiating the extraction process using the StartDocumentTextDetection API. Once Amazon Textract completes its analysis, you can request the extracted text and document structure information using the GetDocumentTextDetection API. This extracted data can then be further processed. Sample code for the SDK can be found at Amazon Textract examples using SDK for Python (Boto3).

For handling slide presentations in the content processing system, you can create a custom solution using AWS Lambda with the python-pptx library. The python-pptx library is a Python library for creating, reading, and updating PowerPoint (.pptx) files. This library can extract information from presentation files, capturing both visible slide content and the valuable context hidden in presenter notes.

You can containerize your Lambda function with the .pptx package and use this container image. The processing Lambda function is automatically triggered when a presentation file is uploaded to a designated folder in the S3 bucket.

For processing web-based content (HTML) in the educational platform, you need to combine AWS Lambda with the Beautiful Soup library. Beautiful Soup is a library that simplifies scraping information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree. You can efficiently parse and extract relevant text from HTML pages so you can capture the essence of web-based educational materials while filtering out extraneous elements.

You can add the Beautiful Soup library as a layer to the Lambda function or zip the library with your code. The processing Lambda function is automatically triggered when an HTML file is uploaded to a designated folder in the S3 bucket.

2. Summarization with Amazon Bedrock

After obtaining the full transcript, use Amazon Bedrock to generate a summary capturing the essence of the content. Amazon Bedrock is a fully managed service offering a choice of high-performing FMs through a single API. To summarize the text, first craft a prompt that instructs the model to summarize the transcript. Then invoke the FM in Amazon Bedrock and extract the summarized content from the model’s response.

3. Storage in Amazon DynamoDB

The final step in the workflow is storing the processed information in Amazon DynamoDB. DynamoDB is a serverless, NoSQL database service that you can use to develop modern applications at any scale. Use the PutItem API call to save the key-value pairs. This call requires a map of attribute name and value pairs as input, one for each attribute. This map could look something like the following code.

Item={
    'ID': path,  # Path in S3 of the original file 
    'summary': summary,  # Generated summary 
    'title': title,  # Title for search results 
    'type': <type>  # Content type (video, audio, pdf, etc.) 
})

The flexible schema and low-latency access of DynamoDB make it ideal for storing and retrieving processed data efficiently.

Data indexing

After being processed, the data needs to be stored on the OpenSearch cluster. OpenSearch uses indexing, which is the method used by search engines, to organize data for fast retrieval.

In OpenSearch, data is stored in documents, which are JSON objects containing fields and values. These documents are organized in indexes. An index is a collection of documents that follows the same structure, or mappings. Mappings specify the fields for a given document and the field types (including text, integer, or date). Mappings are defined on index creation. You can adapt the mappings to fit your needs and create different indexes for various data. If you don’t specify any mappings, OpenSearch will use the default settings, known as Dynamic mapping.

Let’s say you are indexing an internal blog page you want to make available to search. The mapping for this page could look something like the following code.

"mappings": {
    "properties": {
      "title": { "type" : "text" },
      "publication_date": { "type" : "date" },
      "authors":{ "type" : "text" },
      "content": { "type" : "text" }
    }
}

In this example, the document is structured into four fields, each with its own type. These fields represent different attributes of the blog page, making each aspect searchable for end users or applications. If multiple values for a field are passed, OpenSearch will treat it as an array.

Before searching the data, it’s important to understand how to interact with an OpenSearch cluster, and how OpenSearch analyses searches. When searching against an index, OpenSearch will analyze the search content in the index using an analyzer. By default, OpenSearch uses the standard analyzer, which removes most punctuation and lowercases letters. This works well for simple searches.

For more complex ways of analyzing your search, OpenSearch offers built-in analyzers you can specify in the index mapping. You can also build your own custom analyzers, which can be used for very specific and complex data. As an example, if you’d like the blog index “title“ attribute to separate words by white spaces only, you can define the mapping to use the built-in white space analyser.

"mappings": {
    "properties": {
      "title": { 
        "type" : "text",
        "analyser": "Whitespace"
      },
    }
}

If you create a unified search platform across multiple data types (such as video, audio, or PDF), consider creating a separate index for each type. Each data type can have a different structure depending on its processing and may need to be analyzed differently. Having different indexes for each data types makes it easier for end users to perform customized searches or search across a specific data type on its own.

Since communication with OpenSearch is done using HTTP requests, interacting with the cluster happens through requests such as PUT, GET, and DELETE. These requests create indexes, add documents, or search through the index. For instance, to add a document to an index, you would use the following code.

PUT <index_name>/_doc/<id>
{ "A JSON": "document" }

As mentioned in the Data processing section, ordinal data is stored in a DynamoDB table after being processed. Use Amazon DynamoDB Streams to trigger a Lambda function that creates indexes and adds data to the cluster. The function, written in Python, uses the popular requests library, allowing you to send HTTP/1.1 requests using code.

While the described solution focuses on near real-time ingestion with automatic triggers, there is also the possibility for bulk uploads in OpenSearch (for example, for historical data). OpenSearch provides a bulk API with a different syntax than previously shown.

POST _bulk
{ "index": { "_index": "<index1>", "_id": "<id>" } }
{ "A JSON": "document" }
{ "index": { "_index": "<index2>", "_id": "<id>" } }
{ "A second JSON": "document" }

Search interface and querying

Now that the data is indexed, it is available to search. Searching is done using HTTP GET requests. At its core, a search query involves specifying the index you want to search within and defining the search criteria using the Query DSL (domain specific language). OpenSearch also supports other types of query language, such as SQL and PPL (Pipe Processing Language), but this post focuses on Query DSL because it is the most popular.

Search types depend on the complexity of the search solution. For example, match queries ask OpenSearch to match input to text within specified fields. Imagine a student wants to find content with titles containing the words generative AI. In Query DSL, this translates to the following code.

GET /<index_name>/_search
{
  "query": {
    "match":{
        "title": "Generative AI"
    }
  }
}

This search returns all documents in the blog index that match. OpenSearch also provides a relevance score for a given query using a probabilistic ranking framework called BM-25. Each individual document will receive a relevance score under “_score” and the complete search will return a "max_score", representing the highest relevance among the returned documents for the given query. Documents in the search response are ordered from highest to lowest relevance.

For a unified search platform, users need to be able to search across multiple indexes to view all available resources and decide what suits them best. Using Query DSL, OpenSearch allows multi-index queries by specifying multiple indexes in the request. Additionally, in some scenarios, users might want to filter content, such as seeing only blogs or videos from the last 2 years. This more complex search is supported by OpenSearch’s Boolean expressions.

In the search shown in the following code block, the must clause combines multiple conditions to the search, equivalent to an AND statement. On top of that, it specifies two indexes, blogs and videos, to search from. In this particular example, OpenSearch returns only documents older than January 1, 2023, with titles containing the words Generative AI.

GET /blogs,videos/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "Generative AI"
          }
        },
        {
          "range": {
            "publish_date": {
              "gte": "2023-01-01"
            }
          }
        }
      ]
    }
  }
}

These searches illustrate how to interact with OpenSearch. Using AWS, you can perform those queries through Lambda functions using the Python requests library as explained the Data indexing section of this post. For end users to interact with OpenSearch, a different Lambda function can be put behind an API hosted on Amazon API Gateway so it can be called from a front-end application. More information can be found at Tutorial: Creating a search application with Amazon OpenSearch Service.

Conclusion

Implementing a unified search platform using Amazon OpenSearch Service provides educational institutions with a scalable, efficient, and cost-effective solution to manage and search their vast content libraries. By consolidating different data sources into a single searchable data store, educational institutions make it possible for students to quickly and easily access the resources they need, enhancing their learning experience. This architecture not only meets the demands of modern digital education but also positions institutions to adapt to future growth and evolving content needs.

To learn more or get started, refer to Amazon OpenSearch Service Documentation and Amazon OpenSearch Service on the AWS Console.

AWS Public Sector Blog