AWS Machine Learning Blog

Translating your website or application automatically with Amazon Translate in your CI/CD pipeline

AWS allows you to deploy websites and applications globally in minutes. This means that both large enterprises and individual developers can reach users, and thus potential customers, all over the world. However, to provide the best experience, you should not only serve content close to your customers, but also make that content available in their native language.

Translating website or an application is part of a process called localization and internationalization (L10N and I18N, respectively). To localize content, companies either hire translators (which requires specialized resources, more so if the number of target languages is large) or assign the task to the same developers that built the systems (which may result in sub-optimal results and prevent developers from completing more important tasks).

Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. For more information about supported languages, see What Is Amazon Translate? Amazon Translate scales automatically to process large amounts of text; you can customize it to handle the specific details of your business and only pay for the amount of text you translate.

This post creates a website with a UI written in English and a continuous integration pipeline that uses Amazon Translate to localize it into Spanish automatically. The following diagram illustrates the architecture of the solution.

 

A primer on website localization

Localizing a website or application is a task usually shared between developers and translators. Developers oversee insertion of placeholders or tags in the user interface with the text they need to localize, whereas the translators are responsible for translating those placeholders into the required languages.

To better isolate the responsibilities of each team and facilitate maintenance, developers usually produce separate files for the translators, which only contain translation pairs. The specific details on how the translation occurs and the format of these files depend on the language, framework, and technology stack of the component you are localizing, but the overall idea is usually the same. (For example, PHP sites that use Symfony rely on YAML files, and Java apps written on Spring likely use a .properties file.)

This post works with a simple Python website written in Flask and switches the language depending on the Accept-Language header sent by the web browser (the value of which depends on the user settings). The website uses a package called Babel for handling the actual translations. Babel is a set of utilities and wrappers on top of Python’s gettext module, which at the same time is an abstraction layer over GNU gettext.

With Babel, when you design a user interface or web page and want to translate a piece of text, you need to tell the translation engine to replace it. For example, you might have the following HTML paragraph:

<p>Hello</p>

To translate it, you just need to wrap the text string with the gettext() function:

<p>{{ gettext(‘Hello’) }}</p>

You can also use the following shorthand alias _(), which is equivalent:

<p>{{ _(‘Hello’) }}</p>

The gettext() function looks up the translation corresponding to the string “Hello” for the currently active language and returns that value. If the translation has not been defined, it returns the original text in English instead. (The curly brackets are the rendering tags that Jinja uses, which is the templating engine that comes bundled with Flask.)

Up to this point, you have indicated what you want to translate, but you also need to provide the actual translations. With gettext, translations are kept in Portable Object (PO) and Machine Object (MO) files, which have .po and .mo extensions, respectively. PO files look like the following text snippet:

#: file.html:23
msgid “Hello”
msgstr “Hola”

Msgid is the original or untranslated string, and msgstr is the localized string. (The first line is just a comment.) PO files usually contain many blocks like this, one per each string that you need to localize, and each language has its separate PO file.

MO files, on the other hand, are a binary representation of PO files, which the gettext system optimizes for usage. Translators work on PO files, and developers compile them into MO files before deploying the application.

Automating the process

Now that you know how the translation process occurs, the next step is automating it. Luckily, Babel comes with a command line utility called pybabel that can help you.

The first step is generating a PO file to serve as a base template, also called a portable object template (POT) file. See the following command:

$ pybabel extract -F babel.cfg -o messages.pot .

This goes over your source code files and looks for calls to the gettext() function, and each occurrence is written to the messages.pot file. Babel.cfg is a configuration file that determines what files to include in the search process, among other things.

The next step is to copy this template for each language you would like to localize your application into. See the following command:

$ pybabel init -i messages.pot -d app/translations -l es

This tells Babel to take the messages.pot template and make a copy to use for the Spanish translation. If you also want to include French, you could enter this command again and specify fr as the language, and so on.

If you open the resulting PO file, you can see that it is not yet localized, so this is where the translators usually come in. But you can use Amazon Translate instead. This post provides a Python script called generate_translations.py to help you. This script reads each PO file, invokes the Amazon Translate API one time for every string, and saves the file with the appropriate translations in place.

See the following code:

#!/usr/bin/env python
import argparse
import boto3
import os
import polib
import time

DEFAULT_SOURCE_LANG = 'en'
DEFAULT_TRANSLATIONS_ROOT_DIR = 'app/translations'

def validate_translations_dir(dir):
    """Checks that the given translations directory exists.

    Args:
        dir: The relative path to the directory.

    Raises:
        Exception: The translations directory does not exist.
    """
    if not os.path.exists(dir):
        raise Exception("Translations directory '{}' does not exist".format(dir))

def get_pofile_path_for_language(lang, translations_dir):
    """Returns the relative path to the .PO file for the specified language.

    Args:
        lang: The language code.
        translations_dir: The relative path to the directory containing all
            translations-related files.

    Returns:
        A string with the relative path to the .PO file.
    """
    return "{}/{}/LC_MESSAGES/messages.po".format(translations_dir, lang)

def get_target_languages(translations_dir):
    """Returns all languages that the app should be translated into.

    Args:
        translations_dir: The relative path to the directory containing all
            translations-related files.

    Returns:
        A list of language codes. For example:

        ['es', 'fr']
    """
    target_languages = next(os.walk(translations_dir))[1]
    print('Detected languages:', target_languages)
    return target_languages

def translate_language(src_lang, dest_lang, pofile_path):
    """Translate the app strings from and to the specified languages by
    invoking the Amazon Translate API.

    When this function completes, the translated strings will be written to the
    .PO file of the destination language.

    Args:
        src_lang: The ISO code of the language to translate from, e.g., 'en'.
        dest_lang: The ISO code of the language to translate to, e.g., 'es'.
        pofile_path: The path to the .PO file containing the strings to be
            translated.
    """
    print("Translating from '{}' to '{}'".format(src_lang, dest_lang))

    translate = boto3.client('translate')

    po = polib.pofile(pofile_path)
    for entry in po:
        print("Translating entry '{}' to language '{}'".format(entry.msgid, dest_lang))

        # Calculate the time that the request takes
        time_before = time.time_ns()
        response = translate.translate_text(
            Text=entry.msgid,
            SourceLanguageCode=src_lang,
            TargetLanguageCode=dest_lang
        )
        time_after = time.time_ns()

        entry.msgstr = response['TranslatedText']

        # Wait if needed to avoid throtling
        time_diff = time_after - time_before
        if time_diff < 50000:
            time.sleep(0.05)

    po.save(pofile_path)

def generate_translations_without_cache(src_lang, translations_dir):
    """Translate the app to all applicable target languages, without using
    a cache.

    Args:
        src_lang: The ISO code of the language to translate from, e.g., 'en'.
        translations_dir: The relative path to the directory containing all
            translations-related files.
    """
    target_languages = get_target_languages(translations_dir)

    for dest_lang in target_languages:
        pofile_path = get_pofile_path_for_language(dest_lang, translations_dir)
        translate_language(src_lang, dest_lang, pofile_path)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--source-language",
                        help="the ISO code of the source language to translate from (e.g., 'en'), defaults to '{}'".format(DEFAULT_SOURCE_LANG),
                        default=DEFAULT_SOURCE_LANG)
    parser.add_argument("--translations-dir",
                        help="the relative path to the directory containing all translation data, defaults to '{}'".format(DEFAULT_TRANSLATIONS_ROOT_DIR),
                        default=DEFAULT_TRANSLATIONS_ROOT_DIR)
    args = parser.parse_args()

    print("Source language is:      {}".format(args.source_language))
    print("Translations dir is:     {}".format(args.translations_dir))

    # Check that the translations folder exists.
    validate_translations_dir(args.translations_dir)

    generate_translations_without_cache(args.source_language, args.translations_dir)

It produces the following output:

$ ./generate_translations.py
Source language is:      en
Translations dir is:     app/translations
Detected languages: ['es']
Translating entry 'Hello' to language 'es'

At this point, the PO files have the strings translated into the appropriate language.

Finally, you need to compile everything into MO files. Again, you can use pybabel. See the following command:

$ pybabel compile -d app/translations/

After this step, your application displays content in Spanish if the request includes the appropriate value in the Accept-Language header. Otherwise, it falls back to English.

Creating a CI/CD pipeline

You have scripted the process of localizing your website, but you still need to trigger it manually. What if you could do this every time a developer checks in new code?

Continuous integration/continuous delivery (CI/CD) pipelines are the perfect solution for this. As the name implies, development teams use CI/CD pipelines to automatically compile their code, run unit and integration tests, and even deploy the application whenever someone pushes a change to the code repository.

You can use the following AWS developer tools to build a CI/CD pipeline:

You can implement a basic pipeline with three stages. The following diagram illustrates this workflow, from the source stage, to the build stage, and to the deploy stage.

The source stage checks out new code whenever a change occurs; the deploy stage pushes a new version to the production environment. This post deploys the site to an Amazon ECS cluster, but you could use Amazon EC2 instances, AWS Lambda, or even on-premises servers.

The key part is the build stage because this is where the actual translation occurs. CodeBuild, the service you use to run the builds, relies on a file called buildspec.yaml, which contains all the commands that need to run as part of the building process. The resulting file has the following content:

version: 0.2

phases:
  install:
    runtime-versions:
      python: 3.7
  pre_build:
    commands:
      - pip install -r requirements.txt
  build:
    commands:
      # Extract translation strings from our application code
      - pybabel extract --omit-header -F babel.cfg -o messages.pot .
      # Generate one translation file per applicable language
      - pybabel init -i messages.pot -d app/translations -l es
      # Run our automated translation script
      - ./generate_translations.py
      # Finally, compile all translations so that they can be used by our app
      - pybabel compile -d app/translations/

      # Build and push the Docker image to ECR
      - $(aws ecr get-login --no-include-email --region us-east-1)
      - docker build -t translate-web-app .
      - docker tag translate-web-app:latest ${WEB_APP_IMAGE_URI}:latest
      - docker push ${WEB_APP_IMAGE_URI}:latest
  post_build:
    commands:
      - printf '[{"name":"WebAppContainer", "imageUri":"%s:latest"}]' $WEB_APP_IMAGE_URI > imagedefinitions.json

artifacts:
  files:
    - imagedefinitions.json

Three important things are happening:

  • You are declaring that your environment should include a Python 3.7 runtime
  • You are installing the required Python dependencies (such as Babel) in the pre_build phase
  • You are invoking the same commands described previously, in the appropriate order, to generate the localized versions of your texts

After this, you are ready to see how everything fits together. Your website currently looks like the following screenshot, with a welcome message in English.

Add a new text paragraph, just after the last one:

<p>{{ _('This is a newly added sentence. This should appear correctly translated as soon as the build process completes. Amazon Translate makes this very easy!') }}</p>

After you commit and push your changes, a new execution of your pipeline should be triggered automatically. When it reaches the build stage, you can check the logs in the CodeBuild console. The translation process should run without issues and produce the following output:

[…]
Translating entry 'Settings' to language 'es'
Translating entry 'Get help' to language 'es'
Translating entry 'Welcome to this site.' to language 'es'
Translating entry 'Use the links on the left to navigate through the site, or sign in using the button at the upper right corner.' to language 'es'
Translating entry 'All text shown in this application is automatically translated as part of your Continuous Integration / Continuous Delivery (CI/CD) flow.' to language 'es'
Translating entry 'This is a newly added sentence. This should appear correctly translated as soon as the build process completes. Amazon Translate makes this very easy!' to language 'es'

After the build stage completes successfully, wait for the deployment to finish. When it’s finished, you should see your new sentence after you refresh the page. The following screenshot shows the website with the updated text.

However, if you change the browser language to Spanish and reload again, the whole site displays in Spanish, including your new paragraph. The following screenshot shows this translation.

Optimizing cost

Your website is automatically and effortlessly localized to as many languages as you want, and Amazon Translate can help you with more than 50 different languages.

However, the localization process takes place during each build, even if the texts have not changed between commits. This is not cost-effective because you are paying to translate content that you already translated.

To improve this, you can use Amazon S3 as a cache to store the localized strings. You can also save the hash of the base POT file and compare it with the new POT file on every run. If the hashes match, the POT file has not changed, and you can download the translations from the cache, which avoids the need to invoke the Amazon Translate API. This saves you money and accelerates the build process because you don’t need to wait for all the translations to complete—they are already in the PO files.

If the POT file has changed, you can use Amazon Translate and upload the new PO files and the hash to S3, which effectively updates the cache.

You have now updated the generate_translations.py script to support this new feature. It now looks like the following code:

 

#!/usr/bin/env python
import argparse
from botocore.exceptions import ClientError
import boto3
import hashlib
import os
import polib
import time

DEFAULT_SOURCE_LANG = 'en'
DEFAULT_TRANSLATIONS_ROOT_DIR = 'app/translations'

def validate_translations_dir(dir):
    """Checks that the given translations directory exists.

    Args:
        dir: The relative path to the directory.

    Raises:
        Exception: The translations directory does not exist.
    """
    if not os.path.exists(dir):
        raise Exception("Translations directory '{}' does not exist".format(dir))

def validate_cache_bucket(s3_client):
    """Checks that the S3 bucket to be used as the cache exists and can be
    accessed with the AWS credentials available to this script.

    Args:
        s3_client: The boto3 S3 client.

    Raises:
        Exception: The TRANSLATIONS_CACHE_BUCKET enviroment variable is not
        set, or the S3 bucket is either unaccessible or does not exist.
    """
    # Check that the cache bucket actually exists in S3.
    bucket = os.environ['TRANSLATIONS_CACHE_BUCKET']
    try:
        s3_client.head_bucket(Bucket=bucket)
    except ClientError as e:
        raise Exception("The translations cache bucket '{}' does not exist or cannot be accessed".format(bucket))

def get_pofile_path_for_language(lang, translations_dir):
    """Returns the relative path to the .PO file for the specified language.

    Args:
        lang: The language code.
        translations_dir: The relative path to the directory containing all
            translations-related files.

    Returns:
        A string with the relative path to the .PO file.
    """
    return "{}/{}/LC_MESSAGES/messages.po".format(translations_dir, lang)

def get_target_languages(translations_dir):
    """Returns all languages that the app should be translated into.

    Args:
        translations_dir: The relative path to the directory containing all
            translations-related files.

    Returns:
        A list of language codes. For example:

        ['es', 'fr']
    """
    target_languages = next(os.walk(translations_dir))[1]
    print('Detected languages:', target_languages)
    return target_languages

def get_hash_from_file(filename):
    """Calculates and returns an SHA256 hash of the specified file.

    Args:
        filename: The path to the file whose hash is to be calculated.

    Returns:
        An SHA256 hash in hexadecimal representation.
    """
    sha256_hash = hashlib.sha256()
    with open(filename, "rb") as f:
        # Read from file in 4KB blocks, this will allow us to handle large files.
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)

    return sha256_hash.hexdigest()

def should_cache_be_used(cache_bucket, s3_client):
    """Determines whether the cache represented by the specified S3 bucket
    is up-to-date and should be used in the translation process.

    The cache is considered to be up-to-date if the bucket contains a file with
    an SHA256 hash of the template file, and this hash matches the one of the
    current template file.

    Args:
        cache_bucket: The name of the S3 bucket to be used as the cache.
        s3_client: The boto3 S3 client.

    Returns:
        A tuple containing, in this order, a boolean indicating whether the
        cache is considered to be up-to-date and the current SHA256 hash of the
        messages file.
    """
    # Calculate the SHA256 hash of the original messages file.
    current_hash = get_hash_from_file('messages.pot')

    # Get the hash of the messages file from the cache bucket.
    use_cache = False
    try:
        hash_response = s3_client.get_object(Bucket=cache_bucket, Key='messages.pot.sha256')
        latest_hash = hash_response['Body'].read().decode('utf-8')

        if latest_hash == current_hash:
            print('Hashes match, will try to use cache first')
            use_cache = True
        else:
            print('Hashes do not match, cache will be skipped')

    except ClientError as e:
        print("'messages.pot.sha256' is not present in bucket")

    return (use_cache, current_hash)

def translate_language(src_lang, dest_lang, pofile_path):
    """Translate the app strings from and to the specified languages by
    invoking the Amazon Translate API.

    When this function completes, the translated strings will be written to the
    .PO file of the destination language.

    Args:
        src_lang: The ISO code of the language to translate from, e.g., 'en'.
        dest_lang: The ISO code of the language to translate to, e.g., 'es'.
        pofile_path: The path to the .PO file containing the strings to be
            translated.
    """
    print("Translating from '{}' to '{}'".format(src_lang, dest_lang))

    translate = boto3.client('translate')

    po = polib.pofile(pofile_path)
    for entry in po:
        print("Translating entry '{}' to language '{}'".format(entry.msgid, dest_lang))

        # Calculate the time that the request takes
        time_before = time.time_ns()
        response = translate.translate_text(
            Text=entry.msgid,
            SourceLanguageCode=src_lang,
            TargetLanguageCode=dest_lang
        )
        time_after = time.time_ns()

        entry.msgstr = response['TranslatedText']

        # Wait if needed to avoid throtling
        time_diff = time_after - time_before
        if time_diff < 50000:
            time.sleep(0.05)

    po.save(pofile_path)

def generate_translations_with_cache(src_lang, translations_dir, cache_bucket):
    """Translate the app to all applicable languages, using the given S3 bucket
    name as cache when possible.

    Args:
        src_lang: The ISO code of the language to translate from, e.g., 'en'.
        translations_dir: The relative path to the directory containing all
            translations-related files.
        cache_bucket: The name of the S3 bucket to be used as the cache.
    """
    s3_client = boto3.client('s3')

    # Check that the cache bucket exists and can be accessed.
    validate_cache_bucket(s3_client)

    # Determine whether we should try to get translations from the cache first.
    # This will save us requests to the Amazon Translate API.
    use_cache, current_hash = should_cache_be_used(cache_bucket, s3_client)

    target_languages = get_target_languages(translations_dir)

    for dest_lang in target_languages:
        pofile_path = get_pofile_path_for_language(dest_lang, translations_dir)
        if use_cache:
            try:
                response = s3_client.get_object(Bucket=cache_bucket, Key="{}/messages.po".format(dest_lang))
                with open(pofile_path, "wb") as f:
                    f.write(response['Body'].read())
                print("Retrieved messages file from cache")
            except ClientError as e:
                print("Cache file not found, will regenerate")
                translate_language(src_lang, dest_lang, pofile_path)
        else:
            translate_language(src_lang, dest_lang, pofile_path)

        # Upload localized messages file to S3 bucket.
        print("Uploading to cache")
        s3_client.put_object(
            Bucket=cache_bucket,
            Key="{}/messages.po".format(dest_lang),
            Body=open(pofile_path, "rb")
        )

    # Update hash in cache bucket, if needed.
    if not use_cache:
        print("Uploading hash")
        s3_client.put_object(
            Bucket=cache_bucket,
            Key='messages.pot.sha256',
            Body=current_hash
        )

def generate_translations_without_cache(src_lang, translations_dir):
    """Translate the app to all applicable target languages, without using
    a cache.

    Args:
        src_lang: The ISO code of the language to translate from, e.g., 'en'.
        translations_dir: The relative path to the directory containing all
            translations-related files.
    """
    target_languages = get_target_languages(translations_dir)

    for dest_lang in target_languages:
        pofile_path = get_pofile_path_for_language(dest_lang, translations_dir)
        translate_language(src_lang, dest_lang, pofile_path)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--source-language",
                        help="the ISO code of the source language to translate from (e.g., 'en'), defaults to '{}'".format(DEFAULT_SOURCE_LANG),
                        default=DEFAULT_SOURCE_LANG)
    parser.add_argument("--translations-dir",
                        help="the relative path to the directory containing all translation data, defaults to '{}'".format(DEFAULT_TRANSLATIONS_ROOT_DIR),
                        default=DEFAULT_TRANSLATIONS_ROOT_DIR)
    parser.add_argument("--no-cache",
                        help="disable the translations cache",
                        action="store_true")
    args = parser.parse_args()

    print("Source language is:      {}".format(args.source_language))
    print("Translations dir is:     {}".format(args.translations_dir))

    # Check that the translations folder exists.
    validate_translations_dir(args.translations_dir)

    if args.no_cache:
        generate_translations_without_cache(args.source_language, args.translations_dir)
    else:
        if 'TRANSLATIONS_CACHE_BUCKET' not in os.environ:
            raise Exception("Environment variable TRANSLATIONS_CACHE_BUCKET is not set, please specify a value")

        generate_translations_with_cache(args.source_language,
                                         args.translations_dir,
                                         os.environ['TRANSLATIONS_CACHE_BUCKET'])

If you run your pipeline, the cache is empty and the logs show how Amazon Translate was used. See the following output:

$ ./generate_translations.py
Source language is:      en
Translations dir is:     app/translations
'messages.pot.sha256' is not present in bucket
Detected languages: ['es']
Translating from 'en' to 'es'
Translating entry 'Blog' to language 'es'
Translating entry 'Subscription plans' to language 'es'
Translating entry 'My account' to language 'es'
Translating entry 'Log off' to language 'es'

However, if you run it again (without changing anything), the cache is used instead. See the following output:

$ ./generate_translations.py
Source language is:      en
Translations dir is:     app/translations
Hashes match, will try to use cache first
Detected languages: ['es']
Retrieved messages file from cache

Conclusion

This post showed a straightforward and cost-effective way to make your content available to users that speak other languages by using Amazon Translate. You can use Amazon S3 as a translations cache to reduce your costs, and you can automate the process using CodeCommit, CodeBuild, CodeDeploy, and CodePipeline.

This post used a Python-based website as an example, but you can adapt the steps to other languages and frameworks because the overall idea remains the same.


About the Author

Carlos Afonso is a Solutions Architect based out of Madrid. He helps startups across Spain and Portugal build robust, fault-tolerant and cost-effective applications on the AWS cloud. When not talking about AWS, you can often find him coding for fun or attempting to craft his own beer (with varying degrees of success).