AWS for Industries

FSI Service Spotlight: Featuring Amazon Textract

Editor’s note: This is the third in a monthly series for Financial Services Industry Service Spotlight.

Welcome to the Service Spotlight blog series. In this series, we plan to highlight five key considerations of a particular service that financial institutions should focus on to help streamline service approval. Each of the five areas will include specific guidance, which may need to be adapted to your specific use case and environment.

This edition of the Service Spotlight will feature Amazon Textract, a fully managed AI service that extracts text, handwriting, and other data from scanned documents that goes beyond simple optical character recognition (OCR) to identify and understand the relationship of the data from forms and tables.

Financial institutions are leveraging Amazon Textract for a number of workloads across banking, capital markets, and insurance. In banking, BlueVine is a financial technology company that provides financing to small- and medium-sized businesses, and it developed a product leveraging Amazon Textract to automate the processing of Paycheck Protection Program (PPP) loan applications. In capital markets, PitchBook is a Morningstar company that tracks every aspect of the public and private equity markets, including venture capital, private equity, and M&A. PitchBook uses Amazon Textract to improve its processing of PDF documents as a part of its research process by as much as 60%. Lastly, in insurance, nib Group is an Australian healthcare fund that provides insurance to over 1.6 million people. nib Group leverages Amazon Textract to automate its claims processing pipeline resulting in a great customer experience while increasing its operational efficiencies.

Achieving Compliance with Amazon Textract

Security is a shared responsibility between AWS and you. AWS is responsible for protecting the infrastructure that runs the AWS services in the AWS Cloud and also provides you with services that you can use securely. Your responsibility is determined by the AWS service that you use. On the customer’s side of the shared responsibility model, customers should first determine their requirements for network connectivity, encryption, and access to other AWS resources. We will dive deeper into those topics in the upcoming sections.

Amazon Textract falls under the scope of the following compliance programs with regard to the AWS side of the shared responsibility model.

  • SOC 1,2,3
  • PCI
  • ISO/IEC 27001:2013, 27017:2015, 27018:2019
  • ISO/IEC 9001:2015
  • OSPAR
  • MTCS

In following sections, we will cover topics on the customer side of the shared responsibility model.

Data Protection with Amazon Textract

Encryption is typically employed to protect data. Although customers can access the Amazon Textract API using Transport Layer Security (TLS) 1.0, AWS recommends accessing the API over TLS 1.2 instead.

Amazon Textract also works in conjunction with AWS Key Management Service (AWS KMS) to allow customers to specify how customer data should be encrypted while it’s being processed and encrypt the results securely within the Amazon Textract service and in Amazon Simple Storage Service (Amazon S3). If customers request Amazon Textract to analyze data through its synchronous API, their data is stored and processed only in memory.

Financial services customers may require that the underlying AWS services do not store any customer data for service improvements or may store data only if it is encrypted with customer-managed keys for temporary periods such as model training or processing. Amazon Textract supports the use of AWS KMS customer-managed keys to encrypt any data stored at rest for the duration of processing. Likewise, the data in the customer-managed output bucket can also be encrypted via a customer-managed key.

network diagram for Amazon TextractFig 1: Figure showing network diagram for Amazon Textract. Control plane API calls for Amazon Textract can be made via the VPC endpoint as detailed in the next section. For the data plane, Amazon Textract accesses data in customer S3 buckets

Isolation of Compute Environments with Amazon Textract

Amazon Textract is a managed service that doesn’t have any compute resources in the customer’s side of the shared responsibility model. As a managed service, Amazon Textract is protected by the AWS global network security procedures that are described in the AWS Architecture Center: Security, Identity, & Compliance.

Customers can also establish a private connection between their VPC and Amazon Textract by creating an interface VPC endpoint. Interface endpoints are powered by AWS PrivateLink, a technology that enables you to privately access Amazon Textract APIs without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. Instances in your VPC don’t need public IP addresses to communicate with Amazon Textract APIs. The use of interface VPC endpoints also ensures that traffic between your VPC and Amazon Textract does not leave the Amazon network. Amazon Textract also supports policy enforcement on VPC endpoints to restrict usage of Amazon Textract within your VPC. The following is an example of an endpoint policy for Amazon Textract. When attached to a VPC endpoint, this policy grants access to the specified Amazon Textract actions for all principals, but only if the principal belongs in your AWS organization.

{
   "Statement":[
      {
         "Principal":"*",
         "Effect":"Allow",
         "Action":[
            "textract:StartDocumentTextDetection",
            "textract:AnalyzeDocument",
            "textract:DetectDocumentText",
            "textract:GetDocumentAnalysis",
            "textract:GetDocumentTextDetection",
            "textract:StartDocumentAnalysis"
         ],
         "Resource":"*",
         "Condition": {
            "StringEquals": {
                "aws:PrincipalOrgID": [
                    "o-aabbccxxyyzz"
                ]
            }
         }
      }
   ]
}

Automating Audits with APIs with Amazon Textract

Financial institutions may be required to periodically audit their AWS services for usage, user activities, and any resource changes as part of their standard IT security and compliance policies. API calls in your AWS environment created by users or IAM roles or another AWS service can be logged using AWS CloudTrail. Amazon Textract supports the following API calls that are logged as events in CloudTrail:

The Analyze and DetectDocumentText are synchronous API calls whereas the APIs beginning with “start” are asynchronous APIs that require the user to provide an input and output data store to save the results of a job. You may, for example, want to ensure that when asynchronous API calls are created, users supply the appropriate output buckets in addition to KMS keys (using the KmsKeyID) parameter at the start of the call. For privacy reasons, CloudTrail will not log the image bytes or bounding box information, but rather log only the location of the document in Amazon S3.

For example, here is a sample output of a StartDocumentAnalysis API call to extract tables and forms from an SEC filing for Amazon.

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "IAMUser",
        "principalId": "*****************",
        "arn": "arn:aws:iam::<account-num>:user/john.doe",
        "accountId": "<account-num>",
        "accessKeyId": "*************",
        "userName": "john.doe"
    },
    "eventTime": "2021-01-19T18:41:25Z",
    "eventSource": "textract.amazonaws.com",
    "eventName": "StartDocumentAnalysis",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "205.251.233.179",
    "userAgent": "aws-cli/1.18.192 Python/3.7.3 Darwin/18.7.0 botocore/1.19.32",
    "requestParameters": {
        "documentLocation": {
            "s3Object": {
                "bucket": "sagemaker-project-1234",
                "name": "Amazon10K.pdf"
            }
        },
        "featureTypes": [
            "TABLES",
            "FORMS"
        ],
        "notificationChannel": {
            "sNSTopicArn": "arn:aws:sns:us-west-2:<account-num>:Textract_analyze_document",
            "roleArn": "arn:aws:iam::<account-num>:role/service-role/AmazonSageMaker-ExecutionRole-20190823T110499"
        }
    },
    "responseElements": {
        "jobId": "73af8679a6e341ac37104df58a2a96a3a33aafb32df5ac8dd83608445a2f697f"
    },
    "requestID": "f3739fdb-9a47-4f7c-b211-f7273ff92efc",
    "eventID": "61697da4-f569-45ff-8a4f-ad769d1d0911",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "eventCategory": "Management",
    "recipientAccountId": "<account-num>"
}

In addition to logging API calls, financial services customers may also want to automate the monitoring of any resource state changes or configuration changes for AWS services. AWS Config enables you to assess, audit, and evaluate the configurations of underlying services such as Amazon Textract using Custom Rules. Custom Rules are AWS Lambda functions that evaluate the underlying logic of the user-defined rule. The Lambda function can be periodically triggered by AWS Config, validate the rule and provides outputs to AWS Config.

Operational Access and Security with Amazon Textract

When processing datasets that are subject to PCI compliance, you may need to opt out of having your documents used to improve the quality of Amazon Textract. Many financial services customers may choose to either communicate with support or implement an organization-wide opt-out policy in AWS Organizations attached to your root account to all applicable AI services:

{
    "services": {
        "@@operators_allowed_for_child_policies": ["@@none"],
        "default": {
            "@@operators_allowed_for_child_policies": ["@@none"],
            "opt_out_policy": {
                "@@operators_allowed_for_child_policies": ["@@none"],
                "@@assign": "optOut"
            }
        }
    }
}

Alternatively, you can restrict this to a single service such as Amazon Textract:

{
    "services": {
        "textract": {
            "opt_out_policy": {
                "@@assign": "optOut",
                "@@operators_allowed_for_child_policies": ["@@none"]
            }
        }
    }
}

Customers may also require knowing who has access to what data and limiting user access as much as possible. For example if the initial extraction of information from unstructured documents is performed by data engineers, who then hand off the results of the job to data scientists for training natural language processing (NLP) models, you may want to limit access to Amazon Textract APIs to data engineers using IAM permissions. Furthermore, using IAM global condition keys, you can further ensure that previous control plane API calls are made via your VPC endpoint. For example, the following IAM policy, which can be added to the Principal making API calls to Amazon Textract, only allows API calls to Amazon Textract made via the VPC endpoint:

{
   "Statement":[
      {
         "Effect":"Deny",
         "Action":[
            "textract:StartDocumentTextDetection",
            "textract:AnalyzeDocument",
            "textract:DetectDocumentText",
            "textract:GetDocumentAnalysis",
            "textract:GetDocumentTextDetection",
            "textract:StartDocumentAnalysis"
         ],
         "Resource":"*",
         "Condition": {
                "StringNotEquals": {
                    "aws:sourceVpce": [
                        "vpce-111bbccc" # Textract VPC endpoint
                    ]
                }
            }
      }
   ]
}

Service control policies (SCPs) are a type of organization policy that you can use to manage permissions in your organization. SCPs offer central control over the maximum available permissions for all accounts in your organization. SCPs help you to ensure your accounts stay within your organization’s access control guidelines.

See the following for an example SCP that only allows synchronous Amazon Textract actions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Allow-Textract-Sync-Api-Only",
            "Effect": "Deny",
            "Action": [
                "textract:GetDocumentAnalysis",
                "textract:GetDocumentTextDetection",
                "textract:StartDocumentAnalysis",
                "textract:StartDocumentTextDetection"
            ],
            "Resource": "*"
        }
    ]
}

Bucket policies are resource-based policies that can be placed directly on a bucket to manage access to the data contained within it. You can restrict access to data in your bucket to Amazon Textract by using the special context key “calledViaFirst,” as shown in the following example:

{
   "Version": "2012-10-17",
   "Id": "Policy1415115909152",
   "Statement": [
     {
       "Sid": "Access-to-VPCE-and-Textract-Only",
       "Principal": "*",
       "Action": ["s3:GetObject",
                  "s3:PutObject"
  ],
       "Effect": "Deny",
       "Resource": ["arn:aws:s3:::vpc-restricted-bucket",
                    "arn:aws:s3:::vpc-restricted-bucket/*"],
       "Condition": {
         "StringNotEquals": {
           "aws:sourceVpce": "vpce-01ad5da5",
           "aws:CalledViaFirst": "textract.amazonaws.com"
         }
       }
     }
   ]
}

Conclusion

In this post, we reviewed Amazon Textract and highlighted key information that can help FSI customers accelerate the approval of the service within these five categories: achieving compliance, data protection, isolation of compute environments, automating audits with APIs, and operational access and security. While not a one-size-fits-all approach, the guidance can be adapted to meet your organization’s security and compliance requirements and provide a consolidated list of key areas for Amazon Textract.

In the meantime, be sure to visit our AWS Financial Services Industry blog channel and stay tuned for more financial services news and best practices.

Alvin Huang

Alvin Huang

Alvin Huang is a Capital Markets Specialist for Worldwide Financial Services Business Development at Amazon Web Services with a focus on data lakes and analytics, and artificial intelligence and machine learning. Alvin has over 19 years of experience in the financial services industry, and prior to joining AWS, he was an Executive Director at J.P. Morgan Chase & Co, where he managed the North America and Latin America trade surveillance teams and led the development of global trade surveillance. Alvin also teaches a Quantitative Risk Management course at Rutgers University and serves on the Rutgers Mathematical Finance Master’s program (MSMF) Advisory Board.

Gene Ting

Gene Ting

Gene Ting is a principal solutions architect at Amazon Web Services. He is focused on helping enterprise customers build and operate workloads securely on AWS. In his free time, Gene enjoys teaching kids technology and sports, as well as following the latest on cybersecurity.

Stefan Natu

Stefan Natu

Stefan Natu is a Principal Machine Learning Specialist at Amazon Web Services. He is focused on helping enterprise customers build, secure and operationalize machine learning solutions on AWS. His academic background is in theoretical physics, and in the past, he worked on a number of data science problems in retail and energy verticals. In his spare time, he enjoys reading machine learning blogs, traveling, playing the guitar, and exploring the food scene in New York City.