AWS Machine Learning Blog
Translating your website or application automatically with Amazon Translate in your CI/CD pipeline
AWS allows you to deploy websites and applications globally in minutes. This means that both large enterprises and individual developers can reach users, and thus potential customers, all over the world. However, to provide the best experience, you should not only serve content close to your customers, but also make that content available in their native language.
Translating website or an application is part of a process called localization and internationalization (L10N and I18N, respectively). To localize content, companies either hire translators (which requires specialized resources, more so if the number of target languages is large) or assign the task to the same developers that built the systems (which may result in sub-optimal results and prevent developers from completing more important tasks).
Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. For more information about supported languages, see What Is Amazon Translate? Amazon Translate scales automatically to process large amounts of text; you can customize it to handle the specific details of your business and only pay for the amount of text you translate.
This post creates a website with a UI written in English and a continuous integration pipeline that uses Amazon Translate to localize it into Spanish automatically. The following diagram illustrates the architecture of the solution.
A primer on website localization
Localizing a website or application is a task usually shared between developers and translators. Developers oversee insertion of placeholders or tags in the user interface with the text they need to localize, whereas the translators are responsible for translating those placeholders into the required languages.
To better isolate the responsibilities of each team and facilitate maintenance, developers usually produce separate files for the translators, which only contain translation pairs. The specific details on how the translation occurs and the format of these files depend on the language, framework, and technology stack of the component you are localizing, but the overall idea is usually the same. (For example, PHP sites that use Symfony rely on YAML files, and Java apps written on Spring likely use a .properties file.)
This post works with a simple Python website written in Flask and switches the language depending on the Accept-Language header sent by the web browser (the value of which depends on the user settings). The website uses a package called Babel for handling the actual translations. Babel is a set of utilities and wrappers on top of Python’s gettext
module, which at the same time is an abstraction layer over GNU gettext
.
With Babel, when you design a user interface or web page and want to translate a piece of text, you need to tell the translation engine to replace it. For example, you might have the following HTML paragraph:
To translate it, you just need to wrap the text string with the gettext()
function:
You can also use the following shorthand alias _()
, which is equivalent:
The gettext()
function looks up the translation corresponding to the string “Hello” for the currently active language and returns that value. If the translation has not been defined, it returns the original text in English instead. (The curly brackets are the rendering tags that Jinja uses, which is the templating engine that comes bundled with Flask.)
Up to this point, you have indicated what you want to translate, but you also need to provide the actual translations. With gettext
, translations are kept in Portable Object (PO) and Machine Object (MO) files, which have .po
and .mo
extensions, respectively. PO files look like the following text snippet:
Msgid
is the original or untranslated string, and msgstr
is the localized string. (The first line is just a comment.) PO files usually contain many blocks like this, one per each string that you need to localize, and each language has its separate PO file.
MO files, on the other hand, are a binary representation of PO files, which the gettext
system optimizes for usage. Translators work on PO files, and developers compile them into MO files before deploying the application.
Automating the process
Now that you know how the translation process occurs, the next step is automating it. Luckily, Babel comes with a command line utility called pybabel
that can help you.
The first step is generating a PO file to serve as a base template, also called a portable object template (POT) file. See the following command:
This goes over your source code files and looks for calls to the gettext()
function, and each occurrence is written to the messages.pot
file. Babel.cfg
is a configuration file that determines what files to include in the search process, among other things.
The next step is to copy this template for each language you would like to localize your application into. See the following command:
This tells Babel to take the messages.pot
template and make a copy to use for the Spanish translation. If you also want to include French, you could enter this command again and specify fr
as the language, and so on.
If you open the resulting PO file, you can see that it is not yet localized, so this is where the translators usually come in. But you can use Amazon Translate instead. This post provides a Python script called generate_translations.py
to help you. This script reads each PO file, invokes the Amazon Translate API one time for every string, and saves the file with the appropriate translations in place.
See the following code:
It produces the following output:
At this point, the PO files have the strings translated into the appropriate language.
Finally, you need to compile everything into MO files. Again, you can use pybabel
. See the following command:
After this step, your application displays content in Spanish if the request includes the appropriate value in the Accept-Language
header. Otherwise, it falls back to English.
Creating a CI/CD pipeline
You have scripted the process of localizing your website, but you still need to trigger it manually. What if you could do this every time a developer checks in new code?
Continuous integration/continuous delivery (CI/CD) pipelines are the perfect solution for this. As the name implies, development teams use CI/CD pipelines to automatically compile their code, run unit and integration tests, and even deploy the application whenever someone pushes a change to the code repository.
You can use the following AWS developer tools to build a CI/CD pipeline:
- AWS CodeCommit – Hosts your code as Git repositories
- AWS CodePipeline – Orchestrates the pipeline itself
- AWS CodeBuild – Executes the build commands
- AWS CodeDeploy – Deploys your website and makes it available to your users
You can implement a basic pipeline with three stages. The following diagram illustrates this workflow, from the source stage, to the build stage, and to the deploy stage.
The source stage checks out new code whenever a change occurs; the deploy stage pushes a new version to the production environment. This post deploys the site to an Amazon ECS cluster, but you could use Amazon EC2 instances, AWS Lambda, or even on-premises servers.
The key part is the build stage because this is where the actual translation occurs. CodeBuild, the service you use to run the builds, relies on a file called buildspec.yaml
, which contains all the commands that need to run as part of the building process. The resulting file has the following content:
Three important things are happening:
- You are declaring that your environment should include a Python 3.7 runtime
- You are installing the required Python dependencies (such as Babel) in the pre_build phase
- You are invoking the same commands described previously, in the appropriate order, to generate the localized versions of your texts
After this, you are ready to see how everything fits together. Your website currently looks like the following screenshot, with a welcome message in English.
Add a new text paragraph, just after the last one:
After you commit and push your changes, a new execution of your pipeline should be triggered automatically. When it reaches the build stage, you can check the logs in the CodeBuild console. The translation process should run without issues and produce the following output:
After the build stage completes successfully, wait for the deployment to finish. When it’s finished, you should see your new sentence after you refresh the page. The following screenshot shows the website with the updated text.
However, if you change the browser language to Spanish and reload again, the whole site displays in Spanish, including your new paragraph. The following screenshot shows this translation.
Optimizing cost
Your website is automatically and effortlessly localized to as many languages as you want, and Amazon Translate can help you with more than 50 different languages.
However, the localization process takes place during each build, even if the texts have not changed between commits. This is not cost-effective because you are paying to translate content that you already translated.
To improve this, you can use Amazon S3 as a cache to store the localized strings. You can also save the hash of the base POT file and compare it with the new POT file on every run. If the hashes match, the POT file has not changed, and you can download the translations from the cache, which avoids the need to invoke the Amazon Translate API. This saves you money and accelerates the build process because you don’t need to wait for all the translations to complete—they are already in the PO files.
If the POT file has changed, you can use Amazon Translate and upload the new PO files and the hash to S3, which effectively updates the cache.
You have now updated the generate_translations.py
script to support this new feature. It now looks like the following code:
If you run your pipeline, the cache is empty and the logs show how Amazon Translate was used. See the following output:
However, if you run it again (without changing anything), the cache is used instead. See the following output:
Conclusion
This post showed a straightforward and cost-effective way to make your content available to users that speak other languages by using Amazon Translate. You can use Amazon S3 as a translations cache to reduce your costs, and you can automate the process using CodeCommit, CodeBuild, CodeDeploy, and CodePipeline.
This post used a Python-based website as an example, but you can adapt the steps to other languages and frameworks because the overall idea remains the same.
About the Author
Carlos Afonso is a Solutions Architect based out of Madrid. He helps startups across Spain and Portugal build robust, fault-tolerant and cost-effective applications on the AWS cloud. When not talking about AWS, you can often find him coding for fun or attempting to craft his own beer (with varying degrees of success).