AWS Machine Learning Blog

Using Amazon Translate to provide language support to Amazon Kendra

Amazon Kendra is a highly accurate and easy-to-use intelligent search service powered by machine learning (ML). Amazon Kendra supports English. This post provides a set of techniques to provide non-English language support when using Amazon Kendra.

We demonstrate these techniques within the context of a question-answer chatbot use case (Q&A bot) where a user can submit a question in any language that Amazon Translate supports through the chatbot. Amazon Kendra searches across a number of documents and returns a result in the language of that query. Amazon Comprehend and Amazon Translate are essential to providing non-English language support.

Our Q&A bot implementation relies on Amazon Simple Storage Service (Amazon S3) to store the documents prior to their ingestion into Amazon Kendra, Amazon Comprehend to detect the query’s dominant language to enable proper query and response translation, Amazon Translate to translate the query and response to and from English, and Amazon Lex to build the conversational user interface and provide the conversational interactions.

All queries, except for English, are translated from their native language into English before being submitted to Amazon Kendra. The Amazon Kendra responses a user sees are also translated. We have stored predefined Spanish response translations while performing real-time translation on all other languages. We use metadata attributes associated with each ingested document to point to the predefined Spanish translations.

We use three use cases to illustrate these techniques and assume that all the languages needing to be translated are supported by Amazon Translate. First, for Spanish language users, each document (we use small documents for the Q&A bot scenario) is translated by Amazon Translate into Spanish and has human vetting. This pre-translation is relevant as a description for Amazon Kendra document ranking model results.

Second, on-the-fly translation of the reading comprehension model responses occurs for all language responses except for English. On-the-fly translation occurs for the document ranking model results except for English and Spanish. We go into more detail on how to implement on-the-fly translation for Amazon Kendra’s different models later in this post.

Third, for English speaking users, translation doesn’t occur, allowing both the query and Amazon Kendra’s responses to be passed to and from Amazon Kendra without change.

The following exchange illustrates the three use cases. We start with English followed by Spanish, French, and Italian.

The following exchange illustrates the three use cases. We start with English followed by Spanish, French, and Italian.

Translation considerations and prerequisites

We perform the following steps on the document:

  1. Run the document through Amazon Translate to get a Spanish language version of the document as well as the title.
  2. Manually review the translation and make any changes desired.
  3. Create a metadata file where one of the attributes is the Spanish translation of the document.
  4. Ingest the English language document and the associated metadata file into Kendra.

The following code is the metadata file for the document:

{
    "Attributes": {
        "_created_at": "2020-10-28T16:48:26.059730Z",
        "_source_uri": "https://aws.amazon.com/kendra/faqs/",
        "spanish_text": "R: Amazon Kendra es un servicio de búsqueda empresarial muy preciso y fácil de usar que funciona con Machine Learning. 
"spanish_title": "P: ¿Qué es Amazon Kendra?"
    },
    "Title": "Q: What is Amazon Kendra?",
    "ContentType": "PLAIN_TEXT"
}

In this case, we have some predefined attributes, such as _created_at and _source_uri, as well as custom attributes such as spanish_text and spanish_title.

In the case of queries in Spanish, you use these attributes to build the response to send back to the user. The fact that the title of the document is in itself a possible user query allows you to have control over the translations.

If your documents are in another language, you need to run Amazon Translate to translate the documents into English before ingestion into Amazon Kendra.

We have not tried translation in other scenarios where the document types and answers can vary widely. However, we believe that the techniques shown in this post allow you to try translation in other scenarios and evaluate the accuracy.

Amazon Kendra processing overview

Now that we have the documents squared away, we build a chatbot using Amazon Lex. The chatbot identifies the language using Amazon Comprehend, translates the query from the user’s language to English, submits a query to the Amazon Kendra index, and translates the result back to the language the query was in. You can apply this approach to any language that Amazon Translate supports.

We use the Amazon Kendra built-in Amazon S3 connector to ingest documents and the Amazon Kendra FAQ ingestion process for getting question-answer pairs into Amazon Kendra. The ingested documents are in English. We manually created a description of each document in Spanish and attached that Spanish description as a metadata attribute. Ideally, all the documents that you use are in English.

If these documents have an overview section, you can use Amazon Translate as the method of generating this metadata description attribute. If your documents are in another language, you need to run Amazon Translate to translate the documents into English before ingestion into Amazon Kendra. The following diagram illustrates our architecture.

The following diagram illustrates our architecture.

We use the Amazon Kendra built-in Amazon S3 connector to ingest documents. If you also have FAQ documents, you also use the Amazon Kendra FAQ ingestion process.

Setting up your resources

In this section, we discuss the steps needed to implement this solution. See the appendix for details on the specifics of these steps. The AWS Lambda function is critical in order to understand where and how to implement the translation. We go into further details on the translation specifics in the next section.

  1. Download the documents and metadata files, decompress the archive, and store them in an S3 bucket. You use this bucket as the source for your Amazon Kendra S3 connector.
  2. Set up Amazon Kendra:
    1. Create an Amazon Kendra index. For instructions, see Getting started with the Amazon Kendra SharePoint connector.
    2. Create an Amazon Kendra S3 data source.
    3. Add attributes.
    4. Ingest the example data source from Amazon S3 into Amazon Kendra.
  3. Set up the fulfillment Lambda function.
  4. Set up the chatbot.

Understanding translation in the fulfillment Lambda function

The Lambda function has been structured into three main sections to process and respond to the user’s query: language detection, submitting a query, and returning the translated result.

Language detection

In the first section, you use Amazon Comprehend to detect the dominant language. For this post, we obtain the user input from the key inputTranscript part of the event submitted by Amazon Lex. Also, if Amazon Comprehend doesn’t have enough confidence in the language detected, it defaults to English. See the following code:

query = event['inputTranscript']
        response =  comprehend.detect_dominant_language(Text = query)
        confidence = response["Languages"][0]['Score']
        if confidence > 0.50:
            language = response["Languages"][0]['LanguageCode']
        else:
            #Default to english if there isn't enough confidence
            language = "en"

Submitting a query

Amazon Kendra currently supports documents and queries in English, so in order to submit your query, you have to translate it.

In the provided example code, after identifying the dominant language, and depending on the language, you translate the query to English. It’s worth noting that we can do a simple check if the language is English or not. For illustration purposes, I include the option of matching Spanish or a different language.

if language == "en":
        pass
    elif language == "es":
        translated_query = translate.translate_text(Text=query, SourceLanguageCode="es", TargetLanguageCode="en")
        query = translated_query['TranslatedText']
    else:
        try:
            translated_query = translate.translate_text(Text=query, SourceLanguageCode=language, TargetLanguageCode="en")
            query = translated_query['TranslatedText']
        except Exception as e:
            return(str(e))

Now that your query is in English, you can submit the query to Amazon Kendra:

response=kendra.query(
QueryText = query,
IndexId = index_id)

There are several options on how to work with the result from Amazon Kendra. For more information, see Analyzing the results in the Amazon Kendra Essentials Workshop. As a chatbot use case, we only work with the first result.

If the first result is from the reading comprehension model (result type Answer) and the language code is different than en (English), you translate the DocumentExcerpt, which is the value to be returned. See the following code:

answer_text = query_result['DocumentExcerpt']['Text']
                if language == "en":
                    pass
                else:
                    result = translate.translate_text(Text=answer_text, SourceLanguageCode="en", TargetLanguageCode=language)
                    answer_text = result['TranslatedText']

If the first result is from the document ranking model (result type Document), you might recall that in the introduction, we have pre-translated the Spanish language results and stored that in the document metadata for Spanish language documents.

The following code shows that:

  • If the language code is es (Spanish), the pre-translated content stored in the metadata field synopsis is returned.
  • If the language code is en (English), the DocumentExcerpt value returned by Amazon Kendra is returned as is.
  • If the language code is neither es or en, the content of DocumentExcerpt is translated to the language detected and returned.
    if language == "es":
        if key['Key'] == 'spanish_text':
            synopsis = key['Value']['StringValue']
            answer_text = synopsis
            if key['Key'] == 'spanish_title':
                document_title = key['Value']['StringValue']
                print('Title: ' + document_title)
    elif language == "en":
        document_title = query_result['DocumentTitle']['Text']
        answer_text = query_result['DocumentExcerpt']['Text']
    else:
        #Placeholder to translate the title if needed
        #document_title = query_result['DocumentTitle']['Text']
        #result = translate.translate_text(Text=document_title, SourceLanguageCode="en", TargetLanguageCode=language)
        #document_title = result['TranslatedText']
        answer_text = query_result['DocumentExcerpt']['Text']
        result = translate.translate_text(Text=answer_text, SourceLanguageCode="en", TargetLanguageCode=language)
        answer_text = result['TranslatedText']
    response = answer_text
    return response

Returning the result

At this point, if you obtained a result, you should have it the language that the question was asked. The last portion of the Lambda function is to return the result to Amazon Lex for it to be passed on to the user’s conversational user interface:

if result == "":
             no_matches = "I'm sorry, I couldn't find matches for your query"
             result = translate.translate_text(Text=no_matches, SourceLanguageCode="en", TargetLanguageCode=language)
             result = result['TranslatedText']
        else:
            #Truncate Text
            if len(result) > 340:
                result = result[:340]
                result = result.rsplit(' ', 1)
                result = result[0]+"..."
    response = {
        "dialogAction": {
            "type": "Close",
            "fulfillmentState": "Fulfilled",
            "message": {
              "contentType": "PlainText",
              "content": result
            },
        }
    }

Conclusion

We have demonstrated a few techniques that you can use to enable Amazon Kendra to provide support for languages other than English. We recommend doing a small pilot and accuracy POC on ground truth questions and answers to determine if these techniques can enable your non-English language use cases.

To follow an interactive tutorial that can help you get started with Amazon Kendra visit our Amazon Kendra Essentials+ Workshop. You can also visit the Amazon Kendra website to dive deep on features, connectors, videos and more.

Appendix

In the above sections of this post we covered translation in Amazon Kendra for the reading comprehension and document ranking models. Below, we will cover translation in Amazon Kendra for FAQ matching.

Translations for the FAQ model

For Amazon Kendra FAQ matching, you can use either real-time or pre-translated responses. Pre-translated responses with human vetting likely provide better results. For pre-translated responses, complete the following steps:

  1. Create one row per language desired for each question.
  2. Create a language attribute that specifies what language the answer is in.
  3. Place the pre-translated response into the FAQ answer column.
  4. Use the language attribute as a query filter.

Pre-translation considerations

This chatbot use case has documents with a small amount of text. This allows us to place the pre-translated document into an attribute. For larger files, we place pre-translated document summaries into the attribute instead. This allows us to return vetted summaries in the native language for each document ranking result. We can continue to use real-time translation for the reading comprehension model passages and suggested answers.

Pre-translation is only effective for the document ranking model and the FAQ model. The reading comprehension model doesn’t return associated attributes. The lack of attributes prevents the use of pre-translated content with the reading comprehension model and requires instead that you use on-the-fly translation for the reading comprehension model results.

Creating an Amazon Kendra data source and adding attributes

For this use case, we use two custom attributes that contain the revised translations to Spanish. These attributes are called spanish_title and spanish_text.

To add them into your index, follow these steps:

  1. On the Amazon Kendra console, on your new index, under Data management, choose Facet definition.
  2. Choose Add field.

  1. For Field name, enter your name (spanish_text).
  2. For Data type, choose String.
  3. For Usage types, select Displayable.
  4. Choose Add.

Choose Add.

  1. Repeat the process for the field spanish_title.

Ingesting the example dataset

Now that you have an Amazon Kendra index, the custom index fields, and the sample documents into your S3 bucket, you create an S3 data source.

  1. On the Amazon Kendra console, on your new index, under Data management, choose Data sources.
  2. Choose Add connector.
  3. For My data source name, enter a name (for example, MyS3Connector).
  4. Choose Next.

Choose Next.

  1. For Enter the data source location, enter the location of your S3 bucket.

For Enter the data source location, enter the location of your S3 bucket.

  1. For IAM role, choose Create a new role.
  2. For Role name, enter a name for your role.

For Role name, enter a name for your role.

  1. For Frequency, choose Run on demand.
  2. Choose Next.

Choose Next.

  1. Validate your settings and choose Add data source.
  2. When the process is complete, you can sync your data source by choosing Sync now.

When the process is complete, you can sync your data source by choosing Sync now.

At this point you can test a sample query by on the search console. For example, the following screenshot shows the results for the question “what is Amazon Kendra?”

For example, the following screenshot shows the results for the question “what is Amazon Kendra?”

Setting up the fulfillment Lambda function

For this use case, the multilingual chatbot requires a Lambda function to query the index as well as perform the translations if needed.

  1. On the Lambda console, choose Create function.

On the Lambda console, choose Create function.

  1. Select Author from scratch.

Select Author from scratch.

  1. For Function name, enter a name.
  2. For Runtime, choose the latest Python version available.

For Runtime, choose the latest Python version available.

  1. For Execution role, select Create a new role with basic Lambda permissions.

For Execution role, select Create a new role with basic Lambda permissions.

  1. Choose Create function.
  2. After creating the function, on the Permissions tab, choose your role to edit it.
  3. On the IAM console, choose Add inline policy.
  4. On the JSON tab, update the following policy to include your Amazon Kendra index ID (you can obtain it on the Amazon Kendra console in the Index section):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "KendraQueries",
                "Effect": "Allow",
                "Action": "kendra:Query",
                "Resource": "arn:aws:kendra:<YOUR_REGION>:<YOUR_AWS_ACCOUNT_IT>:index/<YOUR_AMAZON_KENDRA_INDEX_ID>"
            },
            {
                "Sid": "ComprehendTranslate",
                "Effect": "Allow",
                "Action": [
                    "comprehend:DetectDominantLanguage",
                    "translate:TranslateText"
                ],
                "Resource": "*"
            }
        ]
    }
  1. Choose Review policy.

Choose Review policy.

  1. For Name, enter a name.
  2. Choose Create policy.

Choose Create policy.

  1. In the Lambda configuration, enter the following code into the function code (update your index_id). The code is also available to download.
    """
    Lexbot Lambda handler.
    """
    from urllib.request import Request, urlopen
    import json
    import boto3
    
    
    kendra = boto3.client('kendra')
    #Define your Index ID
    index_id=<YOUR_AMAZON_KENDRA_INDEX_ID>
    region = 'us-east-1'
    
    translate = boto3.client(service_name='translate', region_name=region, use_ssl=True)
    comprehend = boto3.client(service_name='comprehend', region_name=region, use_ssl=True)
    
    
    def query_index(query, language):
        print("Query: "+query)
        if language == "en":
            pass
        elif language == "es":
            translated_query = translate.translate_text(Text=query, SourceLanguageCode="es", TargetLanguageCode="en")
            query = translated_query['TranslatedText']
        else:
            try:
                translated_query = translate.translate_text(Text=query, SourceLanguageCode=language, TargetLanguageCode="en")
                query = translated_query['TranslatedText']
            except Exception as e:
                return(str(e))     
        response=kendra.query(
            QueryText = query,
            IndexId = index_id)
        print(response)
        #Return just the first result
        for query_result in response['ResultItems']:
            #Reading comprehension result
            if query_result['Type']=='ANSWER':
                    url = query_result['DocumentURI']
                    answer_text = query_result['DocumentExcerpt']['Text']
                    if language == "en":
                        pass
                    else:
                        result = translate.translate_text(Text=answer_text, SourceLanguageCode="en", TargetLanguageCode=language)
                        answer_text = result['TranslatedText']
                    response = answer_text
                    return response
            #Document Ranking result    
            if query_result['Type']=='DOCUMENT':
                if query_result['ScoreAttributes']['ScoreConfidence'] == "LOW":
                    response = ""
                    return(response)
                else:    
                    synopsis = ""
                    document_title = ""
                    answer_text= ""
                    url = ""
                    for key in query_result['DocumentAttributes']:
                        if language == "es":
                            if key['Key'] == 'spanish_text':
                                synopsis = key['Value']['StringValue']
                                answer_text = synopsis
                                if key['Key'] == 'spanish_title':
                                    document_title = key['Value']['StringValue']
                                    print('Title: ' + document_title)
                        elif language == "en":
                            document_title = query_result['DocumentTitle']['Text']
                            answer_text = query_result['DocumentExcerpt']['Text']
                        else:
                            #Placeholder to translate the title if needed
                            #document_title = query_result['DocumentTitle']['Text']
                            #result = translate.translate_text(Text=document_title, SourceLanguageCode="en", TargetLanguageCode=language)
                            #document_title = result['TranslatedText']
                            answer_text = query_result['DocumentExcerpt']['Text']
                            result = translate.translate_text(Text=answer_text, SourceLanguageCode="en", TargetLanguageCode=language)
                            answer_text = result['TranslatedText']
                        response = answer_text
                        return response
        
    def lambda_handler(event, context):
        if(len(event['inputTranscript']) < 3):
            result = "Please try again"
        else:
            query = event['inputTranscript']
            response =  comprehend.detect_dominant_language(Text = query)
            confidence = response["Languages"][0]['Score']
            if confidence > 0.50:
                language = response["Languages"][0]['LanguageCode']
            else:
                #Default to english if there isn't enough confidence
                language = "en"
            result = query_index(query, language)
            if result == "":
                 no_matches = "I'm sorry, I couldn't find matches for your query"
                 result = translate.translate_text(Text=no_matches, SourceLanguageCode="en", TargetLanguageCode=language)
                 result = result['TranslatedText']
            else:
                #Truncate Text
                if len(result) > 340:
                    result = result[:340]
                    result = result.rsplit(' ', 1)
                    result = result[0]+"..."
        response = {
            "dialogAction": {
                "type": "Close",
                "fulfillmentState": "Fulfilled",
                "message": {
                  "contentType": "PlainText",
                  "content": result
                },
            }
        }
        print('result = ' + str(response))
    
     
             
  2. Choose Deploy.

Choose Deploy.

Setting up the chatbot

The chatbot that you create for this use case uses Lambda to fulfill the requests. Essentially, you create a fallback intent and pass the user input to the Lambda function.

To set up a chatbot on the console, complete the following steps:

  1. On the Amazon Lex console, under Bots, choose Create.
  2. Choose Custom bot.

Choose Custom bot.

  1. For Bot name, enter a name.
  2. For Language, choose English (US).
  3. Leave the other options at their defaults.

  1. Choose Create.

For this post, we use the fallback intent to process the queries sent to Amazon Kendra. First we need to create an intent.

  1. Choose Create intent.

Choose Create intent.

  1. Enter a name for your intent and choose Add.

  1. Under Sample utterances, enter some sample utterances.

Under Sample utterances, enter some sample utterances.

  1. Under Response, enter an example answer.

Under Response, enter an example answer.

  1. Choose Save Intent.

Now you can build and test your bot (see the following screenshot).

  1. To import the fallback intent, next to Intents, choose the icon.

  1. Choose Search existing intents.

  1. Search for and choose the built-in intent AMAZON.FallbackIntent.

Search for and choose the built-in intent AMAZON.FallbackIntent.

  1. Enter a name.
  2. Choose Add.

  1. For Fulfillment, select AWS Lambda function.

For Fulfillment, select AWS Lambda function.

  1. For Lambda function, choose the function you created.

For Lambda function, choose the function you created.

  1. Choose Save Intent.

Now you disable the clarification questions so you can use the fallback intent on the first attempt.

  1. Under Error handling, deselect Clarification prompts.
  2. Choose Save.

Choose Save.

  1. Choose Build.

Testing

After the bot building process is complete, you can test your bot directly on the Amazon Kendra console.

Now we issue the same query in French (“Qu’est-ce qu’Amazon Kendra?”) and we get the response back in French.

If you want to test your chatbot as a standalone web application, see Sample Amazon Lex Web Interface on GitHub.

You can also test the Amazon Lex integration with Slack or Facebook Messenger.


About the Author

Juan Bustos is an AI Services Specialist Solutions Architect at Amazon Web Services, based in Dallas, TX. Outside of work, he loves spending time writing and playing music as well as trying random restaurants with his family.

 

 

 

David Shute is a Senior ML GTM Specialist at Amazon Web Services focused on Amazon Kendra. When not working, he enjoys hiking and walking on a beach.