AWS Machine Learning Blog

Using Amazon Textract with AWS PrivateLink

Amazon Textract now supports Amazon Virtual Private Cloud (Amazon VPC) endpoints via AWS PrivateLink so you can securely initiate API calls to Amazon Textract from within your VPC and avoid using the public internet.

In this post, we show you how to access Amazon Textract APIs from within your VPC without traversing the public internet, and how to use VPC endpoint policies to restrict access to Amazon Textract.

Amazon Textract is a fully managed machine learning (ML) service that automatically extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

You can use AWS PrivateLink to access Amazon Textract securely by keeping your network traffic within the AWS network, while simplifying your internal network architecture. It enables you to privately access Amazon Textract APIs from your VPC in a scalable manner by using interface VPC endpoints. A VPC endpoint is an elastic network interface in your subnet with a private IP address that serves as the entry point for all Amazon Textract API calls. A VPC endpoint enables you to privately connect your VPC to supported AWS services and VPC endpoint services powered by AWS PrivateLink without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. Instances in your VPC don’t require public IP addresses to communicate with resources in the service. Traffic between your VPC and the other service doesn’t leave the AWS network.

The following diagram illustrates the solution architecture.

Prerequisites

To get started, you need to have a VPC set up in the AWS Region of your choice. For instructions, see Getting started with Amazon VPC. In this post, we use the us-east-2 Region. You should also have an AWS account with sufficient access to create resources in the following services:

  • Amazon Textract
  • AWS PrivateLink

Solution overview

The walkthrough includes the following high-level steps:

  1. Create VPC endpoints.
  2. Use Amazon Textract via AWS PrivateLink.

Creating VPC endpoints

To create a VPC endpoint, complete the following steps. We use the us-east-2 Region in this post, so the console and URLs may differ depending on the Region you choose.

  1. On the Amazon VPC console, choose Endpoints.
  2. Choose Create Endpoint.
  3. For Service category, select AWS services.
  4. For Service Name, choose amazonaws.us-east-2-textract or com.amazonaws.us-east-2.textract-fips.
  5. For VPC, enter the VPC you want to use.
  6. For Availability Zone, select your preferred Availability Zones.
  7. For Enable DNS name, select Enable for this endpoint.

This creates a private hosted zone that enables you to access the resources in your VPC using custom DNS domain names, such as example.com, instead of using private IPv4 addresses or private DNS hostnames provided by AWS. The Amazon Textract DNS hostname that the AWS Command Line Interface (AWS CLI) and Amazon Textract SDKs use by default (https://textract.Region.amazonaws.com) resolves to your VPC endpoint.

  1. For Security group, choose the security group to associate with the endpoint network interface.

If you don’t specify a security group, the default security group for your VPC is associated.

  1. Choose Create Endpoint.

When the Status changes to available, your VPC endpoint is ready for use.

  1. Choose the Policy tab to apply more restrictive access control to the VPC endpoint.

The following example policy limits VPC endpoint access to only the DetectDocumentText API. An IAM principal, even with access to all Textract APIs, can still only access the specific API in the following policy using this VPC endpoint. This is an additional layer of access control applied at the VPC endpoint. You should apply the principle of least privilege when defining your own policy. For more information, see Controlling access to services with VPC endpoints.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "textract:DetectDocumentText"
            ],
            "Resource": [
                "*"
            ],
            "Effect": "Allow",
            "Principal": "*"
        }
    ]
}

Now that you have set up your VPC endpoint, the following section shows you how to access Amazon Textract APIs from within that VPC using AWS PrivateLink.

Accessing Amazon Textract APIs via AWS PrivateLink

After you set up the relevant VPC endpoint policies, you have two options to configure endpoints in order to access Amazon Textract APIs:

The following code is an example AWS CLI command to run from within the VPC:

$ aws textract detect-document-text --document '{"S3Object":{"Bucket":"textract-test-bucket","Name":"example-doc.jpg"}}' --region us-east-2
  • You can also use the DNS name that was generated when creating the VPC endpoint. These DNS names are in the form of *.us-east-2.vpce.amazonaws.com or *.textract-fips.us-east-2.vpce.amazonaws.com. For example: vpce-0f1aa01f0ce676709-il663k5n.textract.us-east-2.vpce.amazonaws.com.

The following code is an example AWS CLI command to run from within the VPC:

aws textract detect-document-text --document '{"S3Object":{"Bucket":"textract-test-bucket","Name":"example-doc.jpg"}}' --region us-east-2 --endpoint https://vpce-05e9d346575f9cb38-1wdh6mi2.textract.us-east-2.vpce.amazonaws.com

Conclusion

You now have successfully configured a VPC endpoint for Amazon Textract in your AWS account. Traffic to Amazon Textract APIs from that VPC endpoint are only within the AWS network. The VPC endpoint policy you configured further allows you to restrict which Amazon Textract APIs are accessible from within that VPC.


About the Author

Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.

 

 

 

Thomas joined Amazon Web Services in 2016 initially working on Application Auto Scaling before moving into this current role at Textract. Before joining AWS, he worked in engineering roles in the domains of computer graphics and networking. Thomas holds a master’s degree in engineering from the university of Leuven in Belgium.