AWS Architecture Blog
Speed Up Translation Jobs with a Fully Automated Translation System Assistant
Like other industries, translation and localization companies face the challenge of providing fast delivery at a low cost. To address this challenge, organizations use Machine Translation (MT) to complement their translator teams. MT is the use of automated software that translates text without the need of human involvement.
One of the most recent advancements is Active Custom Translation (ACT). ACT helps tailor translated text to a specific language style or terminology, per customer specifications. In the past, organizations built custom models to include ACT in their translation system. Amazon Translate has an Active Custom Translation feature, which helps customers integrate configurable MT capabilities into their translation systems, without needing to build it themselves.
This blog describes an end-to-end automated translation flow, including guidelines to manage the data involved in the ACT process. The solution combines Amazon Translate with other Amazon Web Services (AWS) such as AWS DataSync and AWS Lambda. Before exploring this architecture, let’s explain a few basic concepts specific to the translation and localization industry.
Standard translation concepts
Translation Memory. It is common to reuse previously generated outputs as components for machine translation systems. This data is commonly called Translation Memory, and is stored and exchanged according to standardized formats (TMX, TSV, or CSV).
Source Text. Translation input data is commonly exchanged as XML Localization Interchange File Format (XLIFF) documents. Amazon Translate recently added the support of XLIFF documents for batch processing.
Figure 1 illustrates a standard translation flow involving machine translation and translation memory. Once the output has been reviewed and finalized, it is part of the company’s intellectual property (IP). It can then be reincorporated into the flywheel as an input to future translation jobs.
Translation assistant solution walkthrough
When using Amazon Translate in batch mode, you must:
- Gather together and make translation input data available to the Translation job
- Monitor the processing and retrieval of the output
- Implement improvised processes to integrate your Translation Management System (TMS) with AWS, as needed
As you can see, this can involve many manual steps. You must download huge files, upload them into Amazon Simple Storage Service (S3), and configure jobs. The solution shown in Figure 2 illustrates these automation activities.
Translation automation activities:
- Upload the translation job input data (source files, custom terminology, translation memory files).
- Initiate the preprocessing step. Scan input files and identify language pairs.
- Create an Amazon Simple Queue Service (SQS) message per language pairs and translation project.
- Create S3 buckets and prefixes for each translation job.
- Create an Amazon Translate job.
- Initiate a post-processing workflow, see Figure 3 (AWS Step Functions).
- Copy the Translation output into the output bucket.
- Publish an Amazon SNS notification to inform on job completion status.
- Download translated files back into customer environment.
In this scenario, translators are operating from their company’s internal infrastructure, although their TMS can also be hosted on the cloud. They first collect the translation input data from their TMS and drop the files onto a shared file server. These files can be XLIFF, TMX, or CSV. We use AWS DataSync to orchestrate and initiate the data transfer from on-premises into an Amazon S3 staging bucket. AWS DataSync provides a few advantages:
- A low code solution that manages the upload/download of translation data from/to AWS
- The ability to schedule the synchronization for both upstream and downstream and control the frequency. This allows for batching translation jobs and optimizes usage and cost for Amazon Translate
- A single point of access to translation data, which reduces the need to manage user accounts and grants access to the data
Once the files are uploaded into the input bucket, DataSync generates an event through Amazon EventBridge. This notification invokes an AWS Lambda function that pushes a message into an Amazon SQS queue. The message contains the list of files to be translated in the current batch. SQS decouples the data upload from the actual processing. Using this workflow provides scalability, service quota limit control, and better error handling.
The queue initiates another Lambda function that creates a file hierarchy in S3 for each translation job. File-naming conventions can be used as a key to separate jobs from each other. The function also prepares translation memory and custom terminology when required. Lastly, it creates and submits the translation job.
The post-processing AWS Step Functions workflow
Amazon Translate is able to generate events into EventBridge upon job completion or failure. We use this capability to invoke a post-processing AWS Step Functions workflow. For instance, some customers must flag machine translated segments within an XLIFF file, so their translators can quickly identify them for manual review.
The flow implemented in the state machine does the following (shown in Figure 3):
- Verifies output of Amazon Translate. Checks for completeness, confirms all segments successfully translated
- Enriches the translation data. Flags machine translated segments by comparing input and output
- Copies output to staging bucket. Prepares for final upload
- Sends SNS notifications to alert operators. Notifies that the batch is complete
This solution is entirely serverless, which frees you from maintaining the infrastructure or software platform. You can focus on the core business logic, and what really differentiates you from your competitors.
As the number of translation projects grow overtime, you can also take advantage of Amazon S3 storage classes to optimize document archiving. A translation service provider can define specific rules per customer or per project. These rules can be configured automatically as the data is copied into S3. The result is that files can be transferred to cheaper storage tiers with predefined retention periods.
Conclusion
In this blog, we’ve described a solution that helps you automate the collection and transfer of translation data. It also assists in the scheduling and orchestration of translation jobs. This leads to greater productivity, reduction in cost, and faster time-to-market. Using AWS, you can decrease maintenance, and create a highly scalable and cost-effective solution. Because of the AWS pay-as-you-go model, you can assess the price per project. This information can be used in your pricing model, and be passed along as service options to your own customers.
To get started with Amazon Translate or read more, check out these blogs: