AWS Machine Learning Blog
Indian language RAG with Cohere multilingual embeddings and Anthropic Claude 3 on Amazon Bedrock
Media and entertainment companies serve multilingual audiences with a wide range of content catering to diverse audience segments. These enterprises have access to massive amounts of data collected over their many years of operations. Much of this data is unstructured text and images. Conventional approaches to analyzing unstructured data for generating new content rely on the use of keyword or synonym matching. These approaches don’t capture the full semantic context of a document, making them less effective for users’ search, content creation, and several other downstream tasks.
Text embeddings use machine learning (ML) capabilities to capture the essence of unstructured data. These embeddings are generated by language models that map natural language text into their numerical representations and, in the process, encode contextual information in the natural language document. Generating text embeddings is the first step to many natural language processing (NLP) applications powered by large language models (LLMs) such as Retrieval Augmented Generation (RAG), text generation, entity extraction, and several other downstream business processes.

Converting text to embeddings using cohere multilingual embedding model
Despite the rising popularity and capabilities of LLMs, the language most often used to converse with the LLM, often through a chat-like interface, is English. And although progress has been made in adapting open source models to comprehend and respond in Indian languages, such efforts fall short of the English language capabilities displayed among larger, state-of-the-art LLMs. This makes it difficult to adopt such models for RAG applications based on Indian languages.
In this post, we showcase a RAG application that can search and query across multiple Indian languages using the Cohere Embed – Multilingual model and Anthropic Claude 3 on Amazon Bedrock. This post focuses on Indian languages, but you can use the approach with other languages that are supported by the LLM.
Solution overview
We use the Flores dataset [1], a benchmark dataset for machine translation between English and low-resource languages. This also serves as a parallel corpus, which is a collection of texts that have been translated into one or more languages.
With the Flores dataset, we can demonstrate that the embeddings and, subsequently, the documents retrieved from the retriever, are relevant for the same question being asked in multiple languages. However, given the sparsity of the dataset (approximately 1,000 lines per language from more than 200 languages), the nature and number of questions that can be asked against the dataset is limited.
After you have downloaded the data, load the data into the pandas data frame for processing. For this demo, we are restricting ourselves to Bengali, Kannada, Malayalam, Tamil, Telugu, Hindi, Marathi, and English. If you are looking to adopt this approach for other languages, make sure the language is supported by both the embedding model and the LLM that’s being used in the RAG setup.
Load the data with the following code:
The Cohere multilingual embedding model
Cohere is a leading enterprise artificial intelligence (AI) platform that builds world-class LLMs and LLM-powered solutions that allow computers to search, capture meaning, and converse in text. They provide ease of use and strong security and privacy controls.
The Cohere Embed – Multilingual model generates vector representations of documents for over 100 languages and is available on Amazon Bedrock. With Amazon Bedrock, you can access the embedding model through an API call, which eliminates the need to manage the underlying infrastructure and makes sure sensitive information remains securely managed and protected.
The multilingual embedding model groups text with similar meanings by assigning them positions in the semantic vector space that are close to each other. Developers can process text in multiple languages without switching between different models. This makes processing more efficient and improves performance for multilingual applications.
Text embeddings turn unstructured data into a structured form. This allows you to objectively compare, dissect, and derive insights from all these documents. Cohere’s new embedding models have a new required input parameter, input_type
, which must be set for every API call and include one of the following four values, which align towards the most frequent use cases for text embeddings:
- input_type=”search_document” – Use this for texts (documents) you want to store in your vector database
- input_type=”search_query” – Use this for search queries to find the most relevant documents in your vector database
- input_type=”classification” – Use this if you use the embeddings as input for a classification system
- input_type=”clustering” – Use this if you use the embeddings for text clustering
Using these input types provides the highest possible quality for the respective tasks. If you want to use the embeddings for multiple use cases, we recommend using input_type="search_document"
.
Prerequisites
To use the Claude 3 Sonnet LLM and the Cohere multilingual embeddings model on this dataset, ensure that you have access to the models in your AWS account under Amazon Bedrock, Model Access section and then proceed with installing the following packages. The following code has been tested to work with the Amazon SageMaker Data Science 3.0 Image, backed by an ml.t3.medium instance.
Create a search index
With all of the prerequisites in place, you can now convert the multilingual corpus into embeddings and store those in hnswlib, a header-only C++ Hierarchical Navigable Small Worlds (HNSW) implementation with Python bindings, insertions, and updates. HNSWLib is an in-memory vector store that can be saved to a file, which should be sufficient for the small dataset we are working with. Use the following code:
Embed and index documents
To embed and store the small multilingual dataset, use the Cohere embed-multilingual-v3.0
model, which creates embeddings with 1,024 dimensions, using the Amazon Bedrock runtime API:
Verify that the embeddings work
To test the solution, write a function that takes a query as input, embeds it, and finds the top N documents most closely related to it:
You can explore what the RAG stack does with a couple of queries in different languages, such as Hindi:
The index returns documents relevant to the search query from across languages:
You can now use these documents retrieved from the index as context while calling the Anthropic Claude 3 Sonnet model on Amazon Bedrock. In production settings with datasets that are several orders of magnitude larger than the Flores dataset, we can make the search results from the index even more relevant by using Cohere’s Rerank models.
Use the system prompt to outline how you want the LLM to process your query:
Let’s pass in the same query in multiple Indian languages:
Conclusion
This post presented a walkthrough for using Cohere’s multilingual embedding model along with Anthropic Claude 3 Sonnet on Amazon Bedrock. In particular, we showed how the same question asked in multiple Indian languages, is getting answered using relevant documents retrieved from a vector store
Cohere’s multilingual embedding model supports over 100 languages. It removes the complexity of building applications that require working with a corpus of documents in different languages. The Cohere Embed model is trained to deliver results in real-world applications. It handles noisy data as inputs, adapts to complex RAG systems, and delivers cost-efficiency from its compression-aware training method.
Start building with Cohere’s multilingual embedding model and Anthropic Claude 3 Sonnet on Amazon Bedrock today.
References
[1] Flores Dataset: https://github.com/facebookresearch/flores/tree/main/flores200
About the Author
Rony K Roy is a Sr. Specialist Solutions Architect, Specializing in AI/ML. Rony helps partners build AI/ML solutions on AWS.