AWS Machine Learning Blog

Automatically extract text and structured data from documents with Amazon Textract

June 2021 – This post has been updated with the latest use cases and capabilities for Amazon Textract. 

Documents are a primary tool for record keeping, communication, collaboration, and transactions across many industries, including financial, medical, legal, and real estate. The millions of mortgage applications and hundreds of millions of W2 tax forms processed each year are just a few examples of such documents. A lot of information is locked in unstructured documents. It usually requires time-consuming and complex processes to enable search and discovery, business process automation, and compliance control for these documents.

In this post, we show how you can take advantage of Amazon Textract to automatically extract text and data from scanned documents without any machine learning (ML) experience. While AWS takes care of building, training, and deploying advanced ML models in a highly available and scalable environment, you take advantage of these models with simple-to-use API actions. We cover the following use cases in this post:

  • Text detection from documents
  • Form and table extraction and processing
  • Multi-column detection and reading order
  • Natural language processing and document classification
  • Natural language processing for medical documents
  • Document translation
  • Search and discovery
  • Compliance control with document redaction
  • PDF document processing

Amazon Textract overview

Before we get started with the use cases, let’s review and introduce some of the core features. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms, information stored in tables, handwritten text, and check boxes. This allows you to use Amazon Textract to instantly read almost any type of document and accurately extract text and data without the need for any manual effort or custom code.

The following images show an example document using Amazon Textract on the AWS Management Console on the Forms output tab.

To quickly download a .zip file containing the output, choose Download results. You can choose various formats, including raw JSON, text, and CSV files for forms and tables.

In addition to the detected content, Amazon Textract provides additional information like confidence scores and bounded boxes for detected elements. It gives you control of how you consume extracted content and integrate it into various business applications.

Amazon Textract provides both synchronous and asynchronous API actions to extract document text and analyze the document text data. Synchronous APIs can be used for single-page documents and low-latency use cases such as mobile capture. Asynchronous APIs can be used for multipage documents such as PDF documents with thousands of pages. For more information, see the Amazon Textract API Reference.

Use cases overview

You can easily take advantage of Amazon Textract API operations using the AWS SDK to build power-smart applications. We also use Amazon Textract Helper, Amazon Textract Caller, Amazon Textract PrettyPrinter, and Amazon Textract Response Parser for some of the following use cases. These packages are published to PyPI to speed up development and integration even further.

Text detection from documents

We start with a simple example of how to detect text from a document. We use the following image as an input document to Amazon Textract. The sample image isn’t good quality, but Amazon Textract can still detect the text with accuracy.

The easiest way to extract information from this document programmatically is through installing Amazon Textract Helper:

python -m pip install amazon-textract-helper

Then we call Amazon Textract to extract information from the document and display the results by running the command line tool:

amazon-textract --input-document "s3://amazon-textract-public-content/blogs/amazon-textract-sample-text-amazon-dot-com.png" --pretty-print LINES

The following screenshot shows our output.

The command line tool uses the Amazon Textract Caller, Amazon Textract PrettyPrint, and Amazon Textract Overlayer packages to generate the results.

The original Amazon Textract response is in JSON format and has the following format:

{
    "Blocks": [
        {
            "Geometry": {
                "BoundingBox": {
                    "Width": 1.0, 
                    "Top": 0.0, 
                    "Left": 0.0, 
                    "Height": 1.0
                }, 
                "Polygon": [
                    {
                        "Y": 0.0, 
                        "X": 0.0
                    }, 
                    {
                        "Y": 0.0, 
                        "X": 1.0
                    }, 
                    {
                        "Y": 1.0, 
                        "X": 1.0
                    }, 
                    {
                        "Y": 1.0, 
                        "X": 0.0
                    }
                ]
            }, 
            "Relationships": [
                {
                    "Type": "CHILD", 
                    "Ids": [
                        "2602b0a6-20e3-4e6e-9e46-3be57fd0844b", 
                        "82aedd57-187f-43dd-9eb1-4f312ca30042", 
                        "52be1777-53f7-42f6-a7cf-6d09bdc15a30", 
                        "7ca7caa6-00ef-4cda-b1aa-5571dfed1a7c"
                    ]
                }
            ], 
            "BlockType": "PAGE", 
            "Id": "8136b2dc-37c1-4300-a9da-6ed8b276ea97"
        }..... 
        
    ], 
    "DocumentMetadata": {
        "Pages": 1
    }
}

By using Amazon Textract Response Parser, it’s easier to de-serialize the JSON response and use in your program, the same way Amazon Textract Helper and Amazon Textract PrettyPrinter use it. The GitHub repository shows some examples.

Form and table extraction and processing

Amazon Textract can provide the inputs required to automatically process forms and tables without human intervention. For example, a bank could write code to read PDFs of loan applications. The information contained in the document could be used to initiate all the necessary background and credit checks to approve the loan so that customers can get instant results for their application rather than having to wait several days for manual review and validation.

The following image is an employment application with form fields, check boxes, and a table.

The following code example extracts forms from the employment application and processes different fields:

export AWS_DEFAULT_REGION=us-east-2; amazon-textract --input-document "s3://amazon-textract-public-content/blogs/employeeapp20210510.png" --pretty-print FORMS TABLES --features FORMS TABLES

The preceding commands produce the following output to visualize the structure of the information.

The key-value pairs from the FORMS output are rendered as a table with Key and Value headlines to allow for easier processing.

For example, changing the output format by including —pretty-print-table-format=csv parameter outputs the data in CSV format (check amazon-textract —help for a list of other formats):

export AWS_DEFAULT_REGION=us-east-2; amazon-textract --input-document "s3://amazon-textract-public-content/blogs/employeeapp20210510.png" --pretty-print FORMS TABLES --features FORMS TABLES --pretty-print-table-format=csv

The following screenshot shows the output.

Amazon Textract can detect tables and their content. A company can extract all the amounts from an expense report (as in the following screenshot) and apply rules, such as any expense more than $1,000 needs extra review.

The following code uses the CSV output from the command line tool and the sample expense report to print the content of each cell, along with a warning message if any expense is more than $1,000:

import csv
import sys
from tabulate import tabulate

reader = csv.reader(sys.stdin)

def isFloat(input):
  try:
    float(input)
    return True
  except ValueError:
    return False

all_rows = list()
for row in reader:
    warning = ""
    if len(row)>4:
      if row[4] and isFloat(row[4]):
        if float(row[4]) > 1000.00 and row[3] and not row[3].strip() == 'Total':
          warning = "Warning - value > $1000.00 and requires review."
      row.append(warning)
    all_rows.append(row)
print(tabulate(all_rows, tablefmt='github'))

Save this code as test-csv.py or copy it from Amazon Simple Storage Service (Amazon S3) at s3://amazon-textract-public-content/blogs/test-csv.py. Then use the following command:

export AWS_DEFAULT_REGION=us-east-2; amazon-textract --input-document "s3://amazon-textract-public-content/blogs/expense-report-example.png" --features TABLES --pretty-print TABLES --pretty-print-table-format csv | python test-csv.py

We receive the following output.

To recap, we started with a document image, called Amazon Textract to identify and receive the table structure and information, applied business logic on the data, and triggered a business process based on the information.

Multi-column detection and reading order

Traditional OCR solutions read left to right and don’t detect multiple columns, so they may generate incorrect reading order for multi-column documents. In addition to detecting text, Amazon Textract provides additional geometry information that you can use to detect multiple columns and print the text in reading order.

The following image is a two-column document. Similar to the earlier example, the image isn’t good quality, but Amazon Textract still performs well.

The following example code processes the document with Amazon Textract and takes advantage of geometry information to print the text in reading order:

import boto3
# Document
s3BucketName = "amazon-textract-public-content"
documentName = "blogs/two-column-image.jpg"

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

#print(response)

# Detect columns and print lines
columns = []
lines = []
for item in response["Blocks"]:
      if item["BlockType"] == "LINE":
        column_found=False
        for index, column in enumerate(columns):
            bbox_left = item["Geometry"]["BoundingBox"]["Left"]
            bbox_right = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]
            bbox_centre = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]/2
            column_centre = column['left'] + column['right']/2

            if (bbox_centre > column['left'] and bbox_centre < column['right']) or (column_centre > bbox_left and column_centre < bbox_right):
                #Bbox appears inside the column
                lines.append([index, item["Text"]])
                column_found=True
                break
        if not column_found:
            columns.append({'left':item["Geometry"]["BoundingBox"]["Left"], 'right':item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]})
            lines.append([len(columns)-1, item["Text"]])

lines.sort(key=lambda x: x[0])
for line in lines:
    print (line[1])

The following image shows the output of the detected text in the correct reading order.

Natural language processing and document classification

Customer emails, support tickets, product reviews, social media, even advertising copy all represent insights into customer sentiment that can be put to work for your business. A lot of such content contains images or scanned versions of documents. After text is extracted from these documents, you can use Amazon Comprehend to detect sentiment, entities, key phrases, syntax, and topics. You can also train Amazon Comprehend to detect custom entities based on your business domain. You can then use these insights to classify documents, automate business process workflows, and ensure compliance.

The following example code processes the first image sample we used earlier with Amazon Textract to extract text and then uses Amazon Comprehend to detect sentiment and entities:

import boto3

# Document
s3BucketName = "amazon-textract-public-content"
documentName = "blogs/simple-document-image.jpg"

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

#print(response)

# Print text
print("\nText\n========")
text = ""
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')
        text = text + " " + item["Text"]

# Amazon Comprehend client
comprehend = boto3.client('comprehend')

# Detect sentiment
sentiment =  comprehend.detect_sentiment(LanguageCode="en", Text=text)
print ("\nSentiment\n========\n{}".format(sentiment.get('Sentiment')))

# Detect entities
entities =  comprehend.detect_entities(LanguageCode="en", Text=text)
print("\nEntities\n========")
for entity in entities["Entities"]:
    print ("{}\t=>\t{}".format(entity["Type"], entity["Text"]))

The following image shows the output text along with the text analysis from Amazon Comprehend. It found the sentiment to be neutral and detected “Amazon” as an organization, “Seattle, WA” as a location, and “July 5th, 1994” as a date, along with other entities.

Natural language processing for medical documents

An important way to improve patient care and accelerate clinical research is by understanding and analyzing the insights and relationships that are “trapped” in free-form medical text. These can include hospital admission notes and patient medical history.

In this example, we use the following document to extract text using Amazon Textract. You then use Amazon Comprehend Medical to extract medical entities, such as medical condition, medication, dosage, strength, and protected health information (PHI).

The following example code detects different medical entities:

import boto3

# Document
s3BucketName = "amazon-textract-public-content"
documentName = "blogs/medical-notes.png"

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

#print(response)

# Print text
print("\nText\n========")
text = ""
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')
        text = text + " " + item["Text"]

# Amazon Comprehend client
comprehend = boto3.client('comprehendmedical')

# Detect medical entities
entities =  comprehend.detect_entities(Text=text)
print("\nMidical Entities\n========")
for entity in entities["Entities"]:
    print("- {}".format(entity["Text"]))
    print ("   Type: {}".format(entity["Type"]))
    print ("   Category: {}".format(entity["Category"]))
    if(entity["Traits"]):
        print("   Traits:")
        for trait in entity["Traits"]:
            print ("    - {}".format(trait["Name"]))
    print("\n")

The following image and text block shows the output of the detected text with information categorized by type. It detected “40yo” as the age with category Protected Health Information. It also detected different medical conditions, including sleeping trouble, rash, inferior turbinates, and erythematous eruption. It recognized different medications and anatomy information.

Medical Entities
========
- 40yo
   Type: AGE
   Category: PROTECTED_HEALTH_INFORMATION
- Sleeping trouble
   Type: DX_NAME
   Category: MEDICAL_CONDITION
   Traits:
    - SYMPTOM
- Clonidine
   Type: GENERIC_NAME
   Category: MEDICATION
- Rash
   Type: DX_NAME
   Category: MEDICAL_CONDITION
   Traits:
    - SYMPTOM
- face
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY
- leg
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY
- Vyvanse
   Type: BRAND_NAME
   Category: MEDICATION
- Clonidine
   Type: GENERIC_NAME
   Category: MEDICATION
- HEENT
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY
- Boggy inferior turbinates
   Type: DX_NAME
   Category: MEDICAL_CONDITION
   Traits:
    - SIGN
- inferior
   Type: DIRECTION
   Category: ANATOMY
- turbinates
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY
- oropharyngeal lesion
   Type: DX_NAME
   Category: MEDICAL_CONDITION
   Traits:
    - SIGN
    - NEGATION
- Lungs
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY
- clear Heart
   Type: DX_NAME
   Category: MEDICAL_CONDITION
   Traits:
    - SIGN
- Heart
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY
- Regular rhythm
   Type: DX_NAME
   Category: MEDICAL_CONDITION
   Traits:
    - SIGN
- Skin
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY
- erythematous eruption
   Type: DX_NAME
   Category: MEDICAL_CONDITION
   Traits:
    - SIGN
- hairline
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY

Document translation

Many organizations localize content for international users, such as websites and applications. They must translate large volumes of documents efficiently. You can use Amazon Textract with Amazon Translate to extract text and data and then translate them into other languages.

The following code example shows translating the text in the first image to German:

import boto3

# Document
s3BucketName = "amazon-textract-public-content"
documentName = "blogs/simple-document-image.jpg"

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

#print(response)

# Amazon Translate client
translate = boto3.client('translate')

print ('')
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')
        result = translate.translate_text(Text=item["Text"], SourceLanguageCode="en", TargetLanguageCode="de")
        print ('\033[92m' + result.get('TranslatedText') + '\033[0m')
    print ('')

The following image shows the output of the detected text, translated to German line by line.

Search and discovery

Extracting structured data from documents and creating a smart index using Amazon Elasticsearch Service (Amazon ES) allows you to search through millions of documents quickly. For example, a mortgage company could use Amazon Textract to process millions of scanned loan applications in a matter of hours and have the extracted data indexed in Amazon ES. This would allow them to create search experiences like searching for loan applications where the applicant name is John Doe, or searching for contracts where the interest rate is 2%.

The following code example extracts text from the first image, stores it in Amazon ES, and searches it using Kibana:

import boto3
from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

def indexDocument(bucketName, objectName, text):

    # Update host with endpoint of your Elasticsearch cluster
    #host = "search--xxxxxxxxxxxxxx.us-east-1.es.amazonaws.com
    host = "searchxxxxxxxxxxxxxxxx.us-east-1.es.amazonaws.com"
    region = 'us-east-1'

    if(text):
        service = 'es'
        ss = boto3.Session()
        credentials = ss.get_credentials()
        region = ss.region_name

        awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)

        es = Elasticsearch(
            hosts = [{'host': host, 'port': 443}],
            http_auth = awsauth,
            use_ssl = True,
            verify_certs = True,
            connection_class = RequestsHttpConnection
        )

        document = {
            "name": "{}".format(objectName),
            "bucket" : "{}".format(bucketName),
            "content" : text
        }

        es.index(index="textract", doc_type="document", id=objectName, body=document)

        print("Indexed document: {}".format(objectName))

# Document
s3BucketName = "amazon-textract-public-content"
documentName = "blogs/simple-document-image.jpg"

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

#print(response)

# Print detected text
text = ""
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')
        text += item["Text"]

indexDocument(s3BucketName, documentName, text)

# You can view index documents in Kibana Dashboard

The following image shows the output of extracted text in Kibana search results.

You can also build a custom UI experience by taking advantage of the Amazon ES APIs. Later in the post, you learn how to extract forms and tables and then index that structured data similarly to enable smart search.

Compliance control with document redaction

Because Amazon Textract identifies data types and form labels automatically, AWS helps secure infrastructure so that you can maintain compliance with information controls. For example, an insurer could use Amazon Textract to feed a workflow that automatically redacts personally identifiable information (PII) for review before archiving claim forms. Amazon Textract recognizes the important fields that require protection.

The following code example extracts all the form fields in the employment application used earlier and redacts all the address fields:

import boto3
from trp import Document
from PIL import Image, ImageDraw

# Document
s3BucketName = "amazon-textract-public-content"
documentName = "blogs/employeeapp20210510.png"

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["FORMS"])

doc = Document(response)

# Redact document
img = Image.open(documentName)

width, height = img.size

if(doc.pages):
    page = doc.pages[0]
    for field in page.form.fields:
        if(field.key and field.value and "address" in field.key.text.lower()):
        #if(field.key and field.value):
            print("Redacting => Key: {}, Value: {}".format(field.key.text, field.value.text))
            
            x1 = field.value.geometry.boundingBox.left*width
            y1 = field.value.geometry.boundingBox.top*height-2
            x2 = x1 + (field.value.geometry.boundingBox.width*width)+5
            y2 = y1 + (field.value.geometry.boundingBox.height*height)+2

            draw = ImageDraw.Draw(img)
            draw.rectangle([x1, y1, x2, y2], fill="Black")

img.save("redacted-{}".format(documentName))    

The following output is the redacted version of the employment application.

PDF document processing (asynchronous API operations)

For the earlier examples, you used images with the synchronous API operations. Now we process PDF files using the asynchronous API operations.

With the amazon-textract command line tool, you can pass in a PDF (the location for the PDF has to be on Amazon S3) and the underlying implementation calls the asynchronous API for StartDocumentTextDetection or StartDocumentAnalysis to start an Amazon Textract job:

amazon-textract --input-document "s3://amazon-textract-public-content/blogs/Amazon-Textract-Pdf.pdf" --pretty-print LINES

The following screenshot shows our output.

When you use the asynchronous API from a Python program or the Python Interpreter, it looks like the following code:

from textractcaller.t_call import call_textract
from textractprettyprinter.t_pretty_print import get_lines_string

response = call_textract(input_document="s3://amazon-textract-public-content/blogs/Amazon-Textract-Pdf.pdf")
print(get_lines_string(response))

We get the following output.

First, StartDocumentTextDetection or StartDocumentAnalysis is called to start an Amazon Textract job. Amazon Textract publishes the results of the Amazon Textract request, including completion status, to Amazon Simple Notification Service (Amazon SNS). You can then use GetDocumentTextDetection or GetDocumentAnalysis to get the results from Amazon Textract.

Conclusion

In this post, we showed you how to use Amazon Textract to automatically extract text and data from scanned documents without any ML experience. We covered use cases in fields such as finance, healthcare, and HR, but there are many other opportunities in which the ability to unlock text and data from unstructured documents could be useful.

You can start using Amazon Textract in the Regions US East (Ohio), US East (Northern Virginia), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Canada (Central), EU (Frankfurt), EU (Ireland), EU (London), EU (Paris), AWS GovCloud (US-East), and AWS GovCloud (US-West).

To learn more about Amazon Textract, read about processing single-page and multipage documents, working with block objects, and code samples.


Kashif Imran is a Solutions Architect at Amazon Web Services. He works with some of the largest strategic AWS customers to provide technical guidance and design advice. His expertise spans application architecture, serverless, containers, NoSQL and machine learning.

 

 

 

Martin Schade is a Senior ML Product SA with the Amazon Textract team. He has 20+ years of experience with internet-related technologies, engineering and architecting solutions and joined AWS in 2014, first guiding some of the largest AWS customers on most efficient and scalable use of AWS services and later focused on AI/ML with a focus on computer vision and at the moment is obsessed with extracting information from documents.