Build multimodal search with Amazon OpenSearch Service

Multimodal search enables both text and image search capabilities, transforming how users access data through search applications. Consider building an online fashion retail store: you can enhance the users’ search experience with a visually appealing application that customers can use to not only search using text but they can also upload an image depicting a desired style and use the uploaded image alongside the input text in order to find the most relevant items for each user. Multimodal search provides more flexibility in deciding how to find the most relevant information for your search.

To enable multimodal search across text, images, and combinations of the two, you generate embeddings for both text-based image metadata and the image itself. Text embeddings capture document semantics, while image embeddings capture visual attributes that help you build rich image search applications.

Amazon Titan Multimodal Embeddings G1 is a multimodal embedding model that generates embeddings to facilitate multimodal search. These embeddings are stored and managed efficiently using specialized vector stores such as Amazon OpenSearch Service, which is designed to store and retrieve large volumes of high-dimensional vectors alongside structured and unstructured data. By using this technology, you can build rich search applications that seamlessly integrate text and visual information.

Amazon OpenSearch Service and Amazon OpenSearch Serverless support the vector engine, which you can use to store and run vector searches. In addition, OpenSearch Service supports neural search, which provides out-of-the-box machine learning (ML) connectors. These ML connectors enable OpenSearch Service to seamlessly integrate with embedding models and large language models (LLMs) hosted on Amazon Bedrock, Amazon SageMaker, and other remote ML platforms such as OpenAI and Cohere. When you use the neural plugin’s connectors, you don’t need to build additional pipelines external to OpenSearch Service to interact with these models during indexing and searching.

This blog post provides a step-by-step guide for building a multimodal search solution using OpenSearch Service. You will use ML connectors to integrate OpenSearch Service with the Amazon Bedrock Titan Multimodal Embeddings model to infer embeddings for your multimodal documents and queries. This post illustrates the process by showing you how to ingest a retail dataset containing both product images and product descriptions into your OpenSearch Service domain and then perform a multimodal search by using vector embeddings generated by the Titan multimodal model. The code used in this tutorial is open source and available on GitHub for you to access and explore.

Multimodal search solution architecture

We will provide the steps required to set up multimodal search using OpenSearch Service. The following image depicts the solution architecture.

Figure 1: Multimodal search architecture

The workflow depicted in the preceding figure is:

You download the retail dataset from Amazon Simple Storage Service (Amazon S3) and ingest it into an OpenSearch k-NN index using an OpenSearch ingest pipeline.
OpenSearch Service calls the Amazon Bedrock Titan Multimodal Embeddings model to generate multimodal vector embeddings for both the product description and image.
Through an OpenSearch Service client, you pass a search query.
OpenSearch Service calls the Amazon Bedrock Titan Multimodal Embeddings model to generate vector embedding for the search query.
OpenSearch runs the neural search and returns the search results to the client.

Let’s look at steps 1, 2, and 4 in more detail.

Step 1: Ingestion of the data into OpenSearch

This step involves the following OpenSearch Service features:

Ingest pipelines – An ingest pipeline is a sequence of processors that are applied to documents as they’re ingested into an index. Here you use a text_image_embedding processor to generate combined vector embeddings for the image and image description.
k-NN index – The k-NN index introduces a custom data type, knn_vector, which allows users to ingest vectors into an OpenSearch index and perform different kinds of k-NN searches. You use the k-NN index to store both the general field data types, such as text, numeric, etc., and specialized field data types, such as knn_vector.

Steps 2 and 4: OpenSearch calls the Amazon Bedrock Titan model

OpenSearch Service uses the Amazon Bedrock connector to generate embeddings for the data. When you send the image and text as part of your indexing and search requests, OpenSearch uses this connector to exchange the inputs with the equivalent embeddings from the Amazon Bedrock Titan model. The highlighted blue box in the architecture diagram depicts the integration of OpenSearch with Amazon Bedrock using this ML-connector feature. This direct integration eliminates the need for an additional component (for example, AWS Lambda) to facilitate the exchange between the two services.

Solution overview

In this post, you will build and run multimodal search using a sample retail dataset. You will use the same multimodal generated embeddings and experiment by running text search only, image search only and both text and image search in OpenSearch Service.

Prerequisites

Create an OpenSearch Service domain. For instructions, see Creating and managing Amazon OpenSearch Service domains. Make sure the following settings are applied when you create the domain, while leaving other settings as default.
- OpenSearch version is 2.13
- The domain has public access
- Fine-grained access control is enabled
- A master user is created
Set up a Python client to interact with the OpenSearch Service domain, preferably on a Jupyter Notebook interface.
Add model access in Amazon Bedrock. For instructions, see add model access.

Note that you need to refer to the Jupyter Notebook in the GitHub repository to run the following steps using Python code in your client environment. The following sections provide the sample blocks of code that contain only the HTTP request path and the request payload to be passed to OpenSearch Service at every step.

Data overview and preparation

You will be using a retail dataset that contains 2,465 retail product samples that belong to different categories such as accessories, home decor, apparel, housewares, books, and instruments. Each product contains metadata including the ID, current stock, name, category, style, description, price, image URL, and gender affinity of the product. You will be using only the product image and product description fields in the solution.

A sample product image and product description from the dataset are shown in the following image:

Figure 2: Sample product image and description

In addition to the original product image, the textual description of the image provides additional metadata for the product, such as color, type, style, suitability, and so on. For more information about the dataset, visit the retail demo store on GitHub.

Step 1: Create the OpenSearch-Amazon Bedrock ML connector

The OpenSearch Service console provides a streamlined integration process that allows you to deploy an Amazon Bedrock-ML connector for multimodal search within minutes. OpenSearch Service console integrations provide AWS CloudFormation templates to automate the steps of Amazon Bedrock model deployment and Amazon Bedrock-ML connector creation in OpenSearch Service.

In the OpenSearch Service console, navigate to Integrations as shown in the following image and search for Titan multi-modal. This returns the CloudFormation template named Integrate with Amazon Bedrock Titan Multi-modal, which you will use in the following steps.Figure 3: Configure domain
Select Configure domain and choose ‘Configure public domain’.
You will be automatically redirected to a CloudFormation template stack as shown in the following image, where most of the configuration is pre-populated for you, including the Amazon Bedrock model, the ML model name, and the AWS Identity and Access Management (IAM) role that is used by Lambda to invoke your OpenSearch domain. Update Amazon OpenSearch Endpoint with your OpenSearch domain endpoint and Model Region with the AWS Region in which your model is available.Figure 4: Create a CloudFormation stack
Before you deploy the stack by clicking ‘Create Stack’, you need to give necessary permissions for the stack to create the ML connector. The CloudFormation template creates a Lambda IAM role for you with the default name LambdaInvokeOpenSearchMLCommonsRole, which you can override if you want to choose a different name. You need to map this IAM role as a Backend role for ml_full_access role in OpenSearch dashboards Security plugin, so that the Lambda function can successfully create the ML connector. To do so,
- Login to the OpenSearch Dashboards using the master user credentials that you created as a part of prerequisites. You can find the Dashboards endpoint on your domain dashboard on the OpenSearch Service console.
- From the main menu choose Security, Roles, and select the ml_full_access role.
- Choose Mapped users, Manage mapping.
- Under Backend roles, add the ARN of the Lambda role (arn:aws:iam::<account-id>:role/LambdaInvokeOpenSearchMLCommonsRole) that needs permission to call your domain.
- Select Map and confirm the user or role shows up under Mapped users.Figure 5: Set permissions in OpenSearch dashboards security plugin
Return back to the CloudFormation stack console, check the box, ‘I acknowledge that AWS CloudFormation might create IAM resources with customised names‘ and click on ‘Create Stack’.
After the stack is deployed, it will create the Amazon Bedrock-ML connector (ConnectorId) and a model identifier (ModelId). Figure 6: CloudFormation stack outputs
Copy the ModelId from the Outputs tab of the CloudFormation stack starting with prefix ‘OpenSearch-bedrock-mm-’ from your CloudFormation console. You will be using this ModelId in the further steps.

Step 2: Create the OpenSearch ingest pipeline with the text_image_embedding processor

You can create an ingest pipeline with the text_image_embedding processor, which transforms the images and descriptions into embeddings during the indexing process.

In the following request payload, you provide the following parameters to the text_image_embedding processor. Specify which index fields to convert to embeddings, which field should store the vector embeddings, and which ML model to use to perform the vector conversion.

model_id (<model_id>) – The model identifier from the previous step.
Embedding (<vector_embedding>) – The k-NN field that stores the vector embeddings.
field_map (<product_description> and <image_binary>) – The field name of the product description and the product image in binary format.

path = "_ingest/pipeline/<bedrock-multimodal-ingest-pipeline>"

..
payload = {
"description": "A text/image embedding pipeline",
"processors": [
{
"text_image_embedding": {
"model_id":<model_id>,
"embedding": <vector_embedding>,
"field_map": {
"text": <product_description>,
"image": <image_binary>
}}}]}

Step 4: Create the k-NN index and ingest the retail dataset

Create the k-NN index and set the pipeline created in the previous step as the default pipeline. Set index.knn to True to perform an approximate k-NN search. The vector_embedding field type must be mapped as a knn_vector. vector_embedding field dimension must be mapped with the number of dimensions of the vector that the model provides.

Amazon Titan Multimodal Embeddings G1 lets you choose the size of the output vector (either 256, 512, or 1024). In this post, you will be using the default 1024 dimensional vectors from the model. You can check the size of dimensions of the model by selecting ‘Providers’ -> ‘Amazon’ tab -> ‘Titan Multimodal Embeddings G1’ tab -> ‘Model attributes’, from your Bedrock console.

Given the smaller size of the dataset and to bias for better recall, you use the faiss engine with the hnsw algorithm and the default l2 space type for your k-NN index. For more information about different engines and space types, refer to k-NN index.

payload = {
"settings": {
"index.knn": True,
"default_pipeline": <ingest-pipeline>
},
"mappings": {
"properties": {
"vector_embedding": {
"type": "knn_vector",
"dimension": 1024
"method": {
"engine": "faiss",
"space_type": "l2",
"name": "hnsw",
"parameters": {}
}
},
"product_description": {"type": "text"},
"image_url": {"type": "text"},
"image_binary": {"type": "binary"}
}}}

Finally, you ingest the retail dataset into the k-NN index using a bulk request. For the ingestion code, refer to the step 7, ‘Ingest the dataset into k-NN index using Bulk request‘ in the Jupyter notebook.

Step 5: Perform multimodal search experiments

Perform the following experiments to explore multimodal search and compare results. For text search, use the sample query “Trendy footwear for women” and set the number of results to 5 (size) throughout the experiments.

Experiment 1: Lexical search

This experiment shows you the limitations of simple lexical search and how the results can be improved using multimodal search.

Run a match query against the product_description field by using the following example query payload:

payload = {
"query": {
"match": {
"product_description": {
"query": "Trendy footwear for women"
}
}
},
"size": 5
}

Results:

Figure 7: Lexical search results

Observation:

As shown in the preceding figure, the first three results refer to a jacket, glasses, and scarf, which are irrelevant to the query. These were returned because of the matching keywords between the query, “Trendy footwear for women” and the product descriptions, such as “trendy” and “women.” Only the last two results are relevant to the query because they contain footwear items.

Only the last two products fulfil the intent of the query, which was to find products that match all words in the query.

Experiment 2: Multimodal search with only text as input

In this experiment, you will use the Titan Multimodal Embeddings model that you deployed previously and run a neural search with only “Trendy footwear for women” (text) as input.

In the k-NN vector field (vector_embedding) of the neural query, you pass the model_id, query_text, and k value as shown in the following example. k denotes the number of results returned by the k-NN search.

payload = {
"query": {
"neural": {
"vector_embedding": {
"query_text": "Trendy footwear for women",
"model_id": <model_id>,
"k": 5
}
}
},
"size": 5
}

Results:

Figure 8: Results from multimodal search using text

Observation:

As shown in the preceding figure, all five results are relevant because each represents a style of footwear. Additionally, the gender preference from the query (women) is also matched in all the results, which indicates that the Titan multimodal embeddings preserved the gender context in both the query and nearest document vectors.

Experiment 3: Multimodal search with only an image as input

In this experiment, you will use only a product image as the input query.

You will use the same neural query and parameters as in the previous experiment but pass the query_image parameter instead of using the query_text parameter. You need to convert the image into binary format and pass the binary string to the query_image parameter:

Figure 9: Image of a woman’s sandal used as the query input

payload = {
"query": {
"neural": {
"vector_embedding": {
"query_image": <query_image_binary>,
"model_id": <model_id>,
"k": 5
}
}
},
"size": 5
}

Results:

Figure 10: Results from multimodal search using an image

Observation:

As shown in the preceding figure, by passing an image of a woman’s sandal, you were able to retrieve similar footwear styles. Though this experiment provides a different set of results compared to the previous experiment, all the results are highly related to the search query. All the matching documents are similar to the searched product image, not only in terms of the product category (footwear) but also in terms of the style (summer footwear), color, and gender affinity of the product.

Experiment 4: Multimodal search with both text and an image

In this last experiment, you will run the same neural query but pass both the image of a woman’s sandal and the text, “dark color” as inputs.

Figure 11: Image of a woman’s sandal used as part of the query input

As before, you will convert the image into its binary form before passing it to the query:

payload = {
"query": {
"neural": {
"vector_embedding": {
"query_image": <query_image_binary>,
"query_text": "dark color",
"model_id": <model_id>,
"k": 5
}
}
},
"size": 5
}

Results:

$payload = { "query": { "neural": { "vector_embedding": { "query_image": <query_image_binary>, "query_text": "dark color", "model_id": <model_id>, "k": 5 } } }, "size": 5 }$

Figure 12: Results of query using text and an image

Observation:

In this experiment, you augmented the image query with a text query to return dark, summer-style shoes. This experiment provided more comprehensive options by taking into consideration both text and image input.

Overall observations

Based on the experiments, all the variants of multimodal search provided more relevant results than a basic lexical search. After experimenting with text-only search, image-only search, and a combination of the two, it’s clear that the combination of text and image modalities provides more search flexibility and, as a result, more specific footwear options to the user.

Clean up

To avoid incurring continued AWS usage charges, delete the Amazon OpenSearch Service domain that you created and delete the CloudFormation stack starting with prefix ‘OpenSearch-bedrock-mm-’ that you deployed to create the ML connector.

Conclusion

In this post, we showed you how to use OpenSearch Service and the Amazon Bedrock Titan Multimodal Embeddings model to run multimodal search using both text and images as inputs. We also explained how the new multimodal processor in OpenSearch Service makes it easier for you to generate text and image embeddings using an OpenSearch ML connector, store the embeddings in a k-NN index, and perform multimodal search.

Learn more about ML-powered search with OpenSearch and set up you multimodal search solution in your own environment using the guidelines in this post. The solution code is also available on the GitHub repo.

About the Authors

Praveen Mohan Prasad is an Analytics Specialist Technical Account Manager at Amazon Web Services and helps customers with pro-active operational reviews on analytics workloads. Praveen actively researches on applying machine learning to improve search relevance.

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on Amazon OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open-source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience. Her favourite pastime is hiking the New England trails and mountains.

AWS Big Data Blog

Build multimodal search with Amazon OpenSearch Service

Multimodal search solution architecture

Step 1: Ingestion of the data into OpenSearch

Steps 2 and 4: OpenSearch calls the Amazon Bedrock Titan model

Solution overview

Prerequisites

Data overview and preparation

Step 1: Create the OpenSearch-Amazon Bedrock ML connector

Step 2: Create the OpenSearch ingest pipeline with the text_image_embedding processor

Step 4: Create the k-NN index and ingest the retail dataset

Step 5: Perform multimodal search experiments

Experiment 1: Lexical search

Results:

Observation:

Experiment 2: Multimodal search with only text as input

Results:

Observation:

Experiment 3: Multimodal search with only an image as input

Results:

Observation:

Experiment 4: Multimodal search with both text and an image

Results:

Observation:

Overall observations

Clean up

Conclusion

About the Authors

Resources

Follow

Learn

Resources

Developers

Help