AWS Machine Learning Blog
Translating documents, spreadsheets, and presentations in Office Open XML format using Amazon Translate
Now you can translate .docx, .xlsx, and .pptx documents using Amazon Translate.
Every organization creates documents, spreadsheets, and presentations to communicate and share information with a large group and keep records for posterity. These days, we interact with people who don’t share the same language as ours. The need for translating such documents has become even more critical in a globally interconnected world. Some large organizations hire a team of professional translators to help with document translation, which involves a lot of time and overhead cost. Multiple tools are available online that enable you to copy and paste text to get the translated equivalent in the language of your choice, but there are few secure and easy methods that allow for native support of translating such documents while keeping formatting intact.
Amazon Translate now supports translation of Office Open XML documents in DOCX, PPTX, and XLSX format. Amazon Translate is a fully managed neural machine translation service that delivers high-quality and affordable language translation in 55 languages. For the full list of languages, see Supported Languages and Language Codes. The document translation feature is available wherever batch translation is available. For more information, see Asynchronous Batch Processing.
In this post, we walk you through a step-by-step process to translate documents on the AWS Management Console. You can also access the Amazon Translate BatchTranslation API for document translation via the AWS Command Line Interface (AWS CLI) or the AWS SDK.
Solution overview
This post walks you through the following steps:
- Create an AWS Identity and Access Management (IAM) role that can access your Amazon Simple Storage Service (Amazon S3) buckets.
- Sort your documents by file type and language.
- Perform the batch translation.
Creating an IAM role to access your S3 buckets
In this post, we create a role that has access to all the S3 buckets in your account to translate documents, spreadsheets, and presentations. You provide this role to Amazon Translate to let the service access your input and output S3 locations. For more information, see AWS Identity and Access Management Documentation.
- Sign in to your personal AWS account.
- On the IAM console, under Access management, choose Roles.
- Choose Create role.
- Choose Another AWS account.
- For Account ID, enter your ID.
- Go to the next page.
- For Filter policies, search and add the
AmazonS3FullAccess
policy.
- Go to the next page.
- Enter a name for the role, for example,
TranslateBatchAPI
. - Go to the role you just created.
- On the Trust relationships tab, choose Edit trust relationship.
- Enter the following service principals:
For example, see the following screenshot.
Sorting your documents
Amazon Translate batch translation works on documents stored in a folder inside an S3 bucket. Batch translation doesn’t work if the file is saved in the root of the S3 bucket. Batch translation also doesn’t support translation of nested files. So you first need to upload the documents you wish to translate in a folder inside an S3 bucket. Sort the documents such that the folders contain files of the same type (DOCX, PPTX, XLSX) and are in the same language. If you have multiple documents of different file types that you need to translate, sort the files such that each Amazon S3 prefix has only one type of document format written in one language.
- On the Amazon S3 console, choose Create bucket.
- Walk through the steps to create your buckets.
For this post, we create two buckets: input-translate-bucket
and output-translate-bucket
.
The buckets contain the following folders for each file type:
docx
pptx
xlsx
Performing batch translation
To implement your batch translation, complete the following steps:
- On to the Amazon Translate console, choose Batch Translation.
- Choose Create job.
For this post, we walk you through translating documents in DOCX format.
- For Name, enter
BatchTranslation
. - For Source language, choose En.
- For Target language, choose Es.
- For Input S3 location, enter
s3://input-translate-bucket/docx/
. - For File format, choose docx.
- For Output S3 location, enter
s3://output-translate-bucket/
. - For Access permissions, select Use an existing IAM role.
- For IAM role, enter
TranslateBatchAPI
.
Because this is an asynchronous translation, the translation begins after the machine resource for the translation is allocated. This can take up to 15 minutes. For more information about performing batch translation jobs, see Starting a Batch Translation Job.
The following screenshot shows the details of your BatchTranslation
job.
When the translation is complete, you can find the output in a folder in your S3 bucket. See the following screenshot.
Conclusion
In this post, we discussed implementing asynchronous batch translation to translate documents in DOCX format. You can repeat the same procedure for spreadsheets and presentations. The translation is simple and you pay only for the number of characters (including spaces) you translate in each format. You can start translating office documents today in all Regions where batch translation is supported. If you’re new to Amazon Translate, try out the Free Tier, which offers 2 million characters per month for the first 12 months, starting from your first translation request.
About the Author
Watson G. Srivathsan is the Sr. Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends you will find him exploring the outdoors in the Pacific Northwest.