AWS Machine Learning Blog

Translating PDF documents using Amazon Translate and Amazon Textract

In 1993, the Portable Document Format or the PDF was born and released to the world. Since then, companies across various industries have been creating, scanning, and storing large volumes of documents in this digital format. These documents and the content within them are vital to supporting your business. Yet in many cases, the content is text-heavy and often written in a different language. This limits the flow of information and can directly influence your organization’s business productivity and global expansion strategy. To address this, you need an automated solution to extract the contents within these PDFs and translate them quickly and cost-efficiently.

In this post, we show you how to create an automated and serverless content-processing pipeline for analyzing text in PDF documents using Amazon Textract and translating them with Amazon Translate.

Amazon Textract automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple OCR to also identify the contents of fields in forms and information stored in tables. This allows Amazon Textract to read virtually any type of document and accurately extract text and data without needing any manual effort or custom code.

Once the text and data are extracted, you can use Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and natural-sounding translation than traditional statistical and rule-based translation algorithms. The translation service is trained on a wide variety of content across different use cases and domains to perform well on many kinds of content. Its asynchronous batch processing capability enables you to translate a large collection of text or HTML documents with a single API call.

Solution overview

To be scalable and cost-effective, this solution uses serverless technologies and managed services. In addition to Amazon Textract and Amazon Translate, the solution uses the following services:

  • Amazon Simple Storage Service (Amazon S3) – Stores your documents and allows for central management with fine-tuned access controls.
  • Amazon Simple Notification Service (Amazon SNS) – Enables you to decouple microservices, distributed systems, and serverless applications with a highly available, durable, secure, fully managed pub/sub messaging service.
  • AWS Lambda – Runs code in response to triggers such as changes in data, changes in application state, or user actions. Because services like Amazon S3 and Amazon SNS can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
  • AWS Step Functions – Coordinates multiple AWS services into serverless workflows.

Solution architecture

The architecture workflow contains the following steps:

  1. Users upload a PDF for analysis to Amazon S3.
  2. The Amazon S3 upload triggers a Lambda function.
  3. The function invokes Amazon Textract to extract text from the PDF in batch mode.
  4. Amazon Textract sends an SNS notification when the job is complete.
  5. A Lambda function reads the Amazon Textract response and stores the extracted text in Amazon S3.
  6. The Lambda function from the previous step invokes Amazon Translate in batch mode to translate the extracted texts into the target language.
  7. The Step Functions-based job poller polls for the translation job to complete.
  8. Step Functions sends an SNS notification when the translation is complete.
  9. A Lambda function reads the translated texts in Amazon S3 and generates a translated document in Amazon S3.

The following diagram illustrates this architecture.

Architecture Diagram showing the workflow how uploading the PDF document to S3 bucket triggers the process of extracting text using Amazon textract and then translating it using Amazon Translate.

For processing documents at scale, you can expand this solution to include Amazon Simple Queue Service (Amazon SQS) to queue the jobs and handle any potential failure related to throttling and default service concurrency limits. For more information about the limits in Amazon Translate and Amazon Textract, see Guidelines and Limits and Limits in Amazon Textract, respectively.

Deploying the solution with AWS CloudFormation

Prerequisite

The first step is to use an AWS CloudFormation template to provision the necessary resources needed for the solution, including the AWS Identity and Access Management (IAM) roles, IAM policies, and SNS topics.

  1. Launch the AWS CloudFormation template by choosing the following (this creates the stack the us-east-1 Region):
  2. For Stack name, enter a unique stack name for this account; for example, document-translate.
  3. For TargetLanguageCode, enter the language code that you want your translated documents in; for example, es for Spanish.

For more information about supported languages, see Supported Languages and Language Codes.

  1. In the Capabilities and transforms section, and select the check-boxes to acknowledge that AWS CloudFormation will create IAM resources and transform the AWS Serverless Application Model (AWS SAM) template.

AWS SAM templates simplify the definition of resources needed for serverless applications. When deploying AWS SAM templates in AWS CloudFormation, AWS CloudFormation performs a transform to convert the AWS SAM template into a CloudFormation template. For more information, see Transform.

  1. Choose Create stack.

Screenshot showing Cloudformation launch page with stack name and input parameters as examples

The stack creation may take up to 20 minutes, after which the status changes to CREATE_COMPLETE. You can see the name of the newly created S3 bucket on the Outputs tab.

Translating the document

To translate your document, upload a document in English to the input folder of the S3 bucket you created in the previous step.

screenshot of S3 bucket uploaded with the document for translation

For this post, we scanned the “Universal Declaration of Human Rights,” created by the United Nations.

screenshot of a sample scanned document-UN Declaration of Human Rights in english

This upload event triggers the Lambda function <Stack name>-S3EventProcessor-<Random string>, which invokes the Amazon Textract startDocumentTextDetection API to extract the text from the scanned document.

When Amazon Textract completes the batch job, it sends an SNS notification. The notification triggers the Lambda function <Stack name>-TextractSNSEventProcessor-<Random string>, which processes the Amazon Textract response page by page to extract the LINE block elements to store them in the S3 bucket.

Amazon Textract extracts LINE block elements with a BoundingBox. A sentence in the scanned document results in multiple LINE block elements. To make sure that Amazon Translate has the entire sentence in scope for translation, the solution combines multiple LINE block elements to recreate the sentence boundary in the source document. This done by using the BreakIterator class available for Java. For more information, see Class BreakIterator.

The sentences are then stored in the S3 bucket as individual objects. Finally, the Amazon Translate job startTextTranslationJob is invoked with the input S3 bucket location where the text to be translated is available.

The Amazon Translate job completion SNS notification from the job poller triggers the Lambda function <Stack name>-TranslateJobSNSEventProcessor-<Random string>. The function creates the editable document by combining the translated texts created by the Amazon Translate batch job in the output folder of the S3 bucket with the following naming convention: inputFileName-TargetLanguageCode.docx.

screenshot of S3 bucket output folder where translated document is located.

The following screenshot shows the document translated in Spanish.

screenshot of a input sample scanned document (UN Declaration of Human Rights) translated from English to Spanish as output

The solution also supports translating documents for right-to-left (RTL) script languages such as Arabic and Hebrew. The following screenshot shows the translated document in Arabic (language code: ar).

screenshot of a input sample scanned document (UN Declaration of Human Rights) translated from English to Arabic as output

For any pipeline failure, check the Amazon CloudWatch logs for the corresponding Lambda function and look for potential errors that caused the failure.

To do a translation in a different language, you can update the LANG_CODE environment variable for the <Stack name>-TextractSEventProcessor-<Random string> function and trigger the solution pipeline by uploading a new document into the input folder of the S3 bucket.

Conclusion

In this post, we demonstrated how to extract text from PDF documents and translate them into an editable document in a different language using Amazon Translate asynchronous batch processing. For a low-latency, low-throughput solution translating smaller PDF documents, you can perform the translation through the real-time Amazon Translate API.

The ability to process data at scale is becoming important to organizations across all industries. Managed machine learning services like Amazon Textract and Amazon Translate can simplify your document processing and translation needs, helping you focus on addressing core business needs while keeping overall IT costs manageable.

For further reading, we recommend the following:


About the Authors

Siva Rajamani is a Boston-based Enterprise Solutions Architect for AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are Serverless, Application Integration, and Security. Outside of work, he enjoys outdoors activities and watching documentaries.

 

 

Sudhanshu Malhotra is a Boston-based Enterprise Solutions Architect for AWS. He’s a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are DevOps, Machine Learning, and Security. When he’s not working with customers on their journey to the cloud, he enjoys reading, hiking, and exploring new cuisines.