AWS Machine Learning Blog

Store output in custom Amazon S3 bucket and encrypt using AWS KMS for multi-page document processing with Amazon Textract

Amazon Textract is a fully managed machine learning (ML) service that makes it easy to process documents at scale by automatically extracting printed text, handwriting, and other data from virtually any type of document. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. This enables businesses across many industries, including financial, medical, legal, and real estate, to easily process large numbers of documents for different business operations. Healthcare providers, for example, can use Amazon Textract to extract patient information from an insurance claim or values from a table in a scanned medical chart without requiring customization or human intervention. The blog post Automatically extract text and structured data from documents with Amazon Textract shows how to use Amazon Textract to automatically extract text and data from scanned documents without any machine learning (ML) experience.

Amazon Textract provides both synchronous and asynchronous API actions to extract document text and analyze the document text data. You can use synchronous APIs for single-page documents and low latency use cases such as mobile capture. Asynchronous APIs can process single-page or multi-page documents such as PDF documents with thousands of pages.

In this post, we show how to control the output location and the AWS Key Management Service (AWS KMS) key used to encrypt the output data when you use the Amazon Textract asynchronous API.

Amazon Textract asynchronous API

Amazon Textract provides asynchronous APIs to extract text and structured data in single-page (jpeg, png, pdf) or multi-page documents that are in PDF format. Processing documents asynchronously allows your application to complete other tasks while it waits for the process to complete. You can use StartDocumentTextDetection and GetDocumentTextDetection to detect lines and words in a document or use StartDocumentAnalysis and GetDocumentAnalysis to detect lines, words, forms, and table data from a document.

The following diagram shows the workflow of an asynchronous API action. We use AWS Lambda as an example of the compute environment calling Amazon Textract, but the general concept applies to other compute environments as well.

  1. You start by calling the StartDocumentTextDetection or StartDocumentAnalysis API with an Amazon Simple Storage Service (Amazon S3) object location that you want to process, and a few additional parameters.
  2. Amazon Textract gets the document from the S3 bucket and starts a job to process the document.
  3. As the document is processed, Amazon Textract’s S3 bucket saves and encrypts the inference results and notifies you using an Amazon Simple Notification Service (Amazon SNS) topic.
  4. You can then call the corresponding GetDocumentTextDetection or GetDocumentAnalysis API to get the results in JSON format.

Store and encrypt output of asynchronous API in custom S3 bucket

When you start an Amazon Textract job by calling StartDocumentTextDetection or StartDocumentAnalysis, an optional parameter in the API action is called OutputConfig. This parameter allows you to specify the S3 bucket for storing the output. Another optional input parameter KMSKeyId allows you to specify the AWS KMS customer master key (CMK) to use to encrypt the output. The user calling the Start operation must have permission to use the specified CMK.

The following diagram shows the overall workflow when you use the output preference parameter with the Amazon Textract asynchronous API.

  1. You start by calling the StartDocumentTextDetection or StartDocumentAnalysis API with an S3 object location, output S3 bucket name, output prefix for S3 path and KMS key ID, and a few additional parameters.
  2. Amazon Textract gets the document from the S3 bucket and starts a job to process the document.
  3. As the document is processed, Amazon Textract stores the JSON output at the path in the output bucket and encrypts it using the KMS CMK that was specified in the start call.
  4. You get a job completion notification via Amazon SMS.
  5. You can then call the corresponding GetDocumentTextDetection GetDocumentAnalysis to get the JSON result. You can also get the JSON result directly from the output S3 bucket at the path with the following format: s3://{S3Bucket}/{S3Prefix}/{TextractJobId}/*.

Starting the asynchronous job with OutputConfig

The following code shows how you can start the asynchronous API job to analyze a document and store encrypted inference output in a custom S3 bucket:

import boto3
client = boto3.client('textract')
response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'string',
            'Name': 'string',
            'Version': 'string'
        }
    },
    ...
    OutputConfig={
        'S3Bucket': 'string',
        'S3Prefix': 'string'
    },
    KMSKeyId='string'
)

The following code shows how you can get the results API job to analyze a document:

response = client.get_document_analysis(JobId='string',MaxResults=123,NextToken='string')

You can also use AWS SDK to download output directly from your custom S3 bucket.

The following table shows how the Amazon Textract output is stored and encrypted based on the provided input parameters of OutputConfig and KMSKeyId.

OutputConfig KMSKeyId Amazon Textract Output
None None Amazon Textract output is stored internally by Amazon Textract and encrypted using AWS owned CMK.
Customer’s S3 Bucket None Amazon Textract output is stored in customer’s S3 bucket and encrypted using SSE-S3
None Customer managed CMK Amazon Textract output is stored internally by Amazon Textract and encrypted using Customer managed CMK.
Customer’s S3 bucket Customer managed CMK Amazon Textract output is stored in customer’s S3 bucket and encrypted using Customer managed CMK.

IAM permissions

When you use the Amazon Textract APIs to start an analysis or detection job, you must have access to the S3 object specified in your call. To take advantage of output preferences to write the output to an encrypted object in Amazon S3, you must have the necessary permissions for both the target S3 bucket and the CMK specified when you call the analysis or detection APIs.

The following example AWS Identity and Access Management (IAM) identity policy allows you to get objects from the textract-input S3 bucket with a prefix:

{
"Sid":"AllowTextractUserToReadInputData",
"Action":["s3:GetObject"],
"Effect":"Allow",
"Resource":["arn:aws:s3:::textract-input/documents/*"]
}

The following IAM identity policy allows you to write output to the textract-output S3 bucket with a prefix:

{
"Sid":"AllowTextractUserToReadInputData",
"Action":["s3:GetObject"],
"Effect":"Allow",
"Resource":["arn:aws:s3:::textract-input/documents/*"]
}

When placing objects into Amazon S3 using SSE-KMS, you need specific permissions on the CMK. The following CMK policy language allows a user (textract-start) to use the CMK to protect the output files from an Amazon Textract analysis or detection job:

{
  "Sid": "Allow use of the key to write Textract output to S3",
  "Effect": "Allow",
  "Principal": {"AWS":"arn:aws:iam::111122223333:user/textract-start"},
  "Action": ["kms:DescribeKey","kms:GenerateDataKey", "kms:ReEncrypt", "kms:Decrypt"],
  "Resource": "*"
}

The following KMS key policy allows a user (textract-get) to get the output file that’s backed by SSE-KMS.

{
"Sid": "Allow use of the key to read S3 objects for output",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::111122223333:user/textract-get"},
"Action": ["kms:Decrypt","kms:DescribeKey"],
"Resource": "*"
}

You must still have separate sections of the key policy to allow the management of the key.

For some workloads, you may need to provide a record of actions taken by a user, role, or an AWS service in Amazon Textract. Amazon Textract is integrated with AWS CloudTrail, which captures all API calls for Amazon Textract as events. For more information, see Logging Amazon Textract API Calls with AWS CloudTrail.

AWS KMS and Amazon S3 provide similar integration with CloudTrail. For more information, see Logging AWS KMS API calls with AWS CloudTrail and Logging Amazon S3 API calls using AWS CloudTrail, respectively. To get log visibility into Amazon S3 GETs and PUTs, you can enable the data trail for Amazon S3. This enables you to have end-to-end visibility into your document-processing lifecycle.

Conclusion

In this post, we showed you how to use the Amazon Textract asynchronous API and your S3 bucket and AWS KMS CMK to store and encrypt the results of Amazon Textract output. We also highlighted how you can use CloudTrail integration to get visibility into your overall document processing lifecycle.

For more information about different security controls in Amazon Textract, see Security in Amazon Textract.

 


About the Authors

Kashif Imran is a Principal Solutions Architect at Amazon Web Services. He works with some of the largest AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement computer vision applications at scale. His expertise spans application architecture, serverless, containers, NoSQL and machine learning.

 

 

 

Peter M. O’Donnell is an AWS Principal Solutions Architect, specializing in security, risk, and compliance with the Strategic Accounts team. Formerly dedicated to a major US commercial bank customer, Peter now supports some of AWS’s largest and most complex strategic customers in security and security-related topics, including data protection, cryptography, incident response, and CISO engagement.