AWS Big Data Blog

Optimize efficiency with language analyzers using scalable multilingual search in Amazon OpenSearch Service

Organizations manage content across multiple languages as they expand globally. Ecommerce platforms, customer support systems, and knowledge bases require efficient multilingual search capabilities to serve diverse user bases effectively. This unified search approach helps multinational organizations maintain centralized content repositories while making sure users, regardless of their preferred language, can effectively find and access relevant information.

Building multi-language applications using language analyzers with OpenSearch commonly involves a significant challenge: multi-language documents require manual preprocessing. This means that in your application, for every document, you must first identify each field’s language, then categorize and label it, storing content in separate, pre-defined language fields (for example, name_en, name_es, and so on) in order to use language analyzers in search to improve search relevancy. This client-side effort is complex, adding workload for language detection, potentially slowing data ingestion, and risking accuracy issues if languages are misidentified. It’s a labor-intensive approach. However, Amazon OpenSearch Service 2.15+ introduces an AI-based ML inference processor. This new feature automatically identifies and tags document languages during ingestion, streamlining the process and removing the burden from your application.

By harnessing the power of AI and using context-aware data modeling and intelligent analyzer selection, this automated solution streamlines document processing by minimizing manual language tagging, and enables automatic language detection during ingestion, providing organizations sophisticated multilingual search capabilities.

Using language identification in OpenSearch Service offers the following benefits:

  • Enhanced user experience – Users can now find relevant content regardless of the language they search in
  • Increased content discovery – The service can surface valuable content across language silos
  • Improved search accuracy – Language-specific analyzers provide better search relevance
  • Automated processing – You can reduce manual language tagging and classification

In this post, we share how to implement a scalable multilingual search solution using OpenSearch Service.

Solution overview

The solution eliminates manual language preprocessing by automatically detecting and handling multilingual content during document ingestion. Instead of manually creating separate language fields (en_notes, es_notes, and so on) or implementing custom language detection systems, the ML inference processor identifies languages and creates appropriate field mappings.

This automated approach improves accuracy compared to traditional manual methods and reduces development complexity and processing overhead, allowing organizations to focus on delivering better search experiences to their global users.

The solution comprises the following key components:

  • ML inference processor – Invokes ML models during document ingestion to enrich content with language metadata
  • Amazon SageMaker integration – Hosts pre-trained language identification models that analyze text fields and return language predictions
  • Language-specific indexing – Applies appropriate analyzers based on detected languages, providing proper handling of stemming, stop words, and character normalization
  • Connector framework – Enables secure communication between OpenSearch Service and Amazon SageMaker endpoints through AWS Identity and Access Management (IAM) role-based authentication.

The following diagram illustrates the workflow of the language detection pipeline.

Workflow of the language detection pipeline

 Figure 1: Workflow of the language detection pipeline

This example demonstrates text classification using XLM-RoBERTa-base for language detection on Amazon SageMaker. You have flexibility in choosing your models and can alternatively use the built-in language detection capabilities of Amazon Comprehend.

In the following sections, we walk through the steps to deploy the solution. For detailed implementation instructions, including code examples and configuration templates, refer to the comprehensive tutorial in the OpenSearch ML Commons GitHub repository.

Prerequisites

You must have the following prerequisites:

Deploy the model

Deploy a pre-trained language identification model on Amazon SageMaker. The XLM-RoBERTa model provides robust multilingual language detection capabilities suitable for most use cases.

Configure the connector

Create an ML connector to establish a secure connection between OpenSearch Service and Amazon SageMaker endpoints, primarily for language detection tasks. The process begins with setting up authentication through IAM roles and policies, applying proper permissions for both services to communicate securely.

After you configure the connector with the appropriate endpoint URLs and credentials, the model is registered and deployed in OpenSearch Service and its modelID is used in subsequent steps.

POST /_plugins/_ml/models/_register
{
  "name": "sagemaker-language-identification",
  "version": "1",
  "function_name": "remote",
  "description": "Remote model for language identification",
  "connector_id": "your_connector_id"
}

Sample response:

{
  "task_id": "hbYheJEBXV92Z6oda7Xb",
  "status": "CREATED",
  "model_id": "hrYheJEBXV92Z6oda7X7"
}

After you configure the connector, you can test is by sending text to the model through OpenSearch Service, and it will return the detected language (for example, sending “Say this is a test” returns en for English).

POST /_plugins/_ml/models/your_model_id/_predict
{
  "parameters": {
    "inputs": "Say this is a test"
  }
}
{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "response": [
              {
                "label": "en",
                "score": 0.9411176443099976
              }
            ]
          }
        }
      ]
    }
  ]
}

Set up the ingest pipeline

Configure the ingest pipeline, which uses ML inference processors to automatically detect the language of the content in the name and notes fields of incoming documents. After language detection, the pipeline creates new language-specific fields by copying the original content to new fields with language suffixes (for example, name_en for English content).

The pipeline uses an ml_inference processor to perform the language detection and copy processors to create the new language-specific fields, making it straightforward to handle multilingual content in your OpenSearch Service index.

PUT _ingest/pipeline/language_classification_pipeline{
  "description": "ingest task details and classify languages",
  "processors": [
    {
      "ml_inference": {
        "": "6s71PJQBPmWsJ5TTUQmc",
        "input_map": [
          {
            "inputs": "name"
          },
          {
            "inputs": "notes"
          }
        ],
        "output_map": [
          {
            "predicted_name_language": "response[0].label"
          },
          {
            "predicted_notes_language": "response[0].label"
          }
        ]
      }
    },
    {
      "copy": {
        "source_field": "name",
        "target_field": "name_{{predicted_name_language}}",
        "ignore_missing": true,
        "override_target": false,
        "remove_source": false
      }
    }
  ]
}
{
  "acknowledged": true
}

Configure the index and ingest documents

Create an index with the ingest pipeline that automatically detects the language of incoming documents and applies appropriate language-specific analysis. When documents are ingested, the system identifies the language of key fields, creates language-specific versions of those fields, and indexes them using the correct language analyzer. This allows for efficient and accurate searching across documents in multiple languages without requiring manual language specification for each document.

Here’s a sample index creation API call demonstrating different language mappings.

PUT /task_index
{
  "settings": {
    "index": {
      "default_pipeline": "language_classification_pipeline"
    }
  },
  "mappings": {
    "properties": {
      "name_en": { "type": "text", "analyzer": "english" },
      "name_es": { "type": "text", "analyzer": "spanish" },
      "name_de": { "type": "text", "analyzer": "german" },
      "notes_en": { "type": "text", "analyzer": "english" },
      "notes_es": { "type": "text", "analyzer": "spanish" },
      "notes_de": { "type": "text", "analyzer": "german" }
    }
  }
}

Next, ingest this input document in German

{
  "name": "Kaufen Sie Katzenminze",
  "notes": "Mittens mag die Sachen von Humboldt wirklich."
}

The German text used in the preceding code will be processed using a German-specific analyzer, supporting proper handling of language-specific characteristics such as compound words and special characters.

After successful ingestion into OpenSearch Service, the resulting document appears as follows:

{
  "_source": {
    "predicted_notes_language": "en",
    "name_en": "Buy catnip",
    "notes": "Mittens really likes the stuff from Humboldt.",
    "predicted_name_language": "en",
    "name": "Buy catnip",
    "notes_en": "Mittens really likes the stuff from Humboldt."
  }
}

Search documents

This step demonstrates the search capability after the multilingual setup. By using a multi_match query with name_* fields, it searches across all language-specific name fields (name_en, name_es, name_de) and successfully finds the Spanish document when searching for “comprar” because the content was properly analyzed using the Spanish analyzer. This example shows how the language-specific indexing enables accurate search results in the correct language without needing to specify which language you’re searching in.

GET /task_index/_search
{
  "query": {
    "multi_match": {
      "query": "comprar",
      "fields": ["name_*"]
    }
  }
}

This search correctly finds the Spanish document because the name_es field is analyzed using the Spanish analyzer:

{
  "hits": {
    "total": { "value": 1, "relation": "eq" },
    "max_score": 0.9331132,
    "hits": [
      {
        "_index": "task_index",
        "_id": "3",
        "_score": 0.9331132,
        "_source": {
          "name_es": "comprar hierba gatera",
          "notes": "A Mittens le gustan mucho las cosas de Humboldt.",
          "predicted_notes_language": "es",
          "predicted_name_language": "es",
          "name": "comprar hierba gatera",
          "notes_es": "A Mittens le gustan mucho las cosas de Humboldt."
        }
      }
    ]
  }
}

Cleanup

To avoid ongoing charges and delete the resources created in this tutorial, perform the following cleanup steps

  1. Delete the Opensearch service domain. This stops both storage costs for your vectorized data and any associated compute charges.
  2. Delete the ML connector that links your OpenSearch service to your machine learning model.
  3. Finally, delete your Amazon SageMaker endpoints and resources.

Conclusion

Implementing multilingual search with OpenSearch Service can help organizations break down language barriers and unlock the full value of their global content. The ML inference processor provides a scalable, automated approach to language detection that improves search accuracy and user experience.

This solution addresses the growing need for multilingual content management as organizations expand globally. By automatically detecting document languages and applying appropriate linguistic processing, businesses can deliver comprehensive search experiences that serve diverse user bases effectively.


About the authors

Sunil Ramachandra

Sunil Ramachandra

Sunil is a Senior Solutions Architect at AWS, enabling hyper-growth Independent Software Vendors (ISVs) to innovate and accelerate on AWS. He partners with customers to build highly scalable and resilient cloud architectures. When not collaborating with customers, Sunil enjoys spending time with family, running, meditating, and watching movies on Prime Video.

Mingshi Liu

Mingshi Liu

Mingshi is a Machine Learning Engineer at AWS, primarily contributing to OpenSearch, ML Commons and Search Processors repo. Her work focuses on developing and integrating machine learning features for search technologies and other open-source projects.

Sampath Kathirvel

Sampath Kathirvel

Sampath is a Senior Solutions Architect at AWS who guides leading ISV organizations in their cloud transformation journey. His expertise lies in crafting robust architectural frameworks and delivering strategic technical guidance to help businesses thrive in the digital landscape. With a passion for technology innovation, Sampath empowers customers to leverage AWS services effectively for their mission-critical workloads.