AWS Machine Learning Blog

Building an NLP-powered search index with Amazon Textract and Amazon Comprehend

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.

Organizations in all industries have a large number of physical documents. It can be difficult to extract text from a scanned document when it contains formats such as tables, forms, paragraphs, and check boxes. Organizations have been addressing these problems with Optical Character Recognition (OCR) technology, but it requires templates for form extraction and custom workflows.

Extracting and analyzing text from images or PDFs is a classic machine learning (ML) and natural language processing (NLP) problem. When extracting the content from a document, you want to maintain the overall context and store the information in a readable and searchable format. Creating a sophisticated algorithm requires a large amount of training data and compute resources. Building and training a perfect machine learning model could be expensive and time-consuming.

This blog post walks you through creating an NLP-powered search index with Amazon Textract and Amazon Comprehend as an automated content-processing pipeline for storing and analyzing scanned image documents. For pdf document processing, please refer AWS Sample github repository to use Textractor.

This solution uses serverless technologies and managed services to be scalable and cost-effective. The services used in this solution include:

Architecture

  1. Users upload OCR image for analysis to Amazon S3.
  2. Amazon S3 upload triggers AWS Lambda.
  3. AWS Lambda invokes Amazon Textract to extract text from image.
  4. AWS Lambda sends the extracted text from image to Amazon Comprehend for entity and key phrase extraction.
  5. This data is indexed and loaded into Amazon Elasticsearch.
  6. Kibana gets indexed data.
  7. Users log into Amazon Cognito.
  8. Amazon Cognito authenticates to Kibana to search documents.

Deploying the architecture with AWS CloudFormation

The first step is to use an AWS CloudFormation template to provision the necessary IAM role and AWS Lambda function to interact with the Amazon S3, AWS Lambda, Amazon Textract, and Amazon Comprehend APIs.

  1. Launch the AWS CloudFormation template in the US-East-1 (Northern Virginia) Region:
  2. You will see the below information on the Create stack screen:
    Stack name: document-search
    CognitoAdminEmail: abc@amazon.com
    DOMAINNAME: documentsearchapp.Edit the CognitoAdminEmail with your email address. You will receive your temporary Kibana credentials in an email.
  3. Scroll down to Capabilities and check both the boxes to provide acknowledgement that AWS CloudFormation will create IAM resources. For more information, see AWS IAM resources.
  4. Scroll down to Transforms and choose Create Change Set.The AWS CloudFormation template uses AWS SAM, which simplifies how to define functions and APIs for serverless applications, as well as features for these services like environment variables. When deploying AWS SAM templates in an AWS CloudFormation template, you need to perform a transform step to convert the AWS SAM template.
  5. Wait a few seconds for the change set to finish computing changes. Your screen should look as follows with Action, Logical Id, Physical Id, Resource Type, and Replacement. Finally, click on the Execute button, which will let AWS CloudFormation launch resources in the background.
  6. The following screenshot of the Stack Detail page shows the Status of the CloudFormation stack as CREATE_IN_PROGRESS. Wait up to 20 minutes for the Status to change to CREATE_COMPLETE. In Outputs, copy the value of S3KeyPhraseBucket and KibanaLoginURL.

Uploading documents to the S3 bucket

To upload your documents to your newly created S3 bucket in the above step, complete the following:

  1. Click on the Amazon S3 Bucket URL you copied from the CloudFormation Output.
  2. Download the example dataset demo-data.zip from the GitHub repo. This dataset contains a variety of images that include forms, a scanned page with paragraphs, and a two-column document.
  3. Unzip the data.
  4. Upload the files in the demo-data folder to the Amazon S3 bucket starting with document-search-blog-s3-<Random string>.

For more information, see How Do I Upload Files and Folders to an S3 Bucket?

After the upload completes, you can see four image files: Employment_application.JPG, expense.png, simple-document-image.jpg, and two-column-image.jpg in the S3 bucket.

Uploading the data to S3 triggers a Lambda S3 event notification to invoke a Lambda function. You can find event triggers configured in your Amazon S3 bucket properties under Advanced settings-> Events. You will see a Lambda function starting with document-search-blog-ComprehendKeyPhraseAnalysis-<Random string>. This Lambda function performs the following:

  • Extracts text from images using Amazon Textract.
  • Performs key phrase extraction using Amazon Comprehend.
  • Searches text using Amazon ES.

The following code example extracts text from images using Amazon Textract:

textract = boto3.client(

service_name='textract',

region_name=region)

#Main lambda handler
def handler(event, context):
    # Get the object from the lambda event and show its content type
       bucket = event['Records'][0]['s3']['bucket']['name']
       key = unquote_plus(event['Records'][0]['s3']['object']['key'])
       s3.Bucket(bucket).download_file(Key=key,Filename='/tmp/{}')
# Read document content and extract text using Textract
with open('/tmp/{}', 'rb') as document:
    imageBytes = bytearray(document.read())
response=textract.analyze_document(Document={'Bytes':imageBytes},FeatureTypes=["TABLES", "FORMS"])

The following code example extracts key phrases using Amazon Comprehend:

#Initializing comprehend
comprehend = boto3.client(service_name='comprehend', region_name=region)
#Detect Keyphrases and Entities
keyphrase_response = comprehend.detect_key_phrases(Text=text, LanguageCode='en')
detect_entity= comprehend.detect_entities(Text=text, LanguageCode='en')

You can index the response received from Amazon Textract and Amazon Comprehend and load it into Amazon ES to create an NLP-powered search index. Refer to the below code:

 #Connection to Elasticsearch
         
        es=connectES()
#Saving Index to Elastocsearch endpoint in primary lambda handler    

        es.index(index="document", doc_type="_doc", body=searchdata)

For more information, see the GitHub repo.

Visualizing and searching documents with Kibana

To visualize and search documents using Kibana, perform the following steps.

  1. Find the email in your inbox with the subject line “Your temporary password.” Check your junk folder if you don’t see an email in your inbox.
  2. Go to the Kibana login URL copied from the AWS CloudFormation output.
  3. Login using your email address as the Username and the temporary password from the confirmation email as the Password.  Click on Sign In.

    Note: If you didn’t receive the email or missed it while deploying AWS CloudFormation, choose Sign up.
  4. On the next page, enter a new password.
  5. On the Kibana landing page, from the menu, choose Discover.
  6. On the Management/Kibana page, you will see Step 1 of 2: Define index pattern, for Index pattern, enter document*. The message “Success! Your index pattern matches 1 index.” appears. Choose Next step. Click Create Index Pattern.
    After a few seconds, you can see the document index page.
  7. Choose Discover from the menu again. In the below screenshot you can see the document attributes.
    s3link:s3 location of uploaded documents in S3, 
    KeyPhrases: Key phrases from the documents uploaded in S3, 
    Entity: it can be DATE, PERSON, ORGANIZATION etc, 
    text: raw text from documents, 
    table:tables extracted from documents, and forms:form extracted from documents 

  8. To view specific fields for each entry, hover over the field in the left sidebar and click Add.

This moves the fields to the Selected Fields menu. Your Kibana dashboard formulates data in an easy-to-read format. The following screenshot shows the view after adding Entity.DATE, Entity.Location, Entity.PERSON, S3link and text in the Selected Fields menu:

To look at your original document, choose s3link.

Note: You can also add forms and table to view and search tables and forms.

Conclusion

This post demonstrated how to extract and process data from an image document and visualize it to create actionable insights.

Processing scanned image documents helps you uncover large amounts of data, which can provide new business prospects. With managed ML services like Amazon Textract and Amazon Comprehend, you can gain insights into your previously undiscovered data. For example, you could build a custom application to get a text from a scanned legal document, purchase receipts, and purchase orders.

Data at scale is relevant in every industry. Whether you process data from images or PDFs, AWS can simplify your data ingestion and analysis while keeping your overall IT costs manageable.

If this blog post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available on the GitHub repo for you to use and extend. Contributions are always welcome!

 


About the authors

Saurabh Shrivastava is a partner solutions architect and big data specialist working with global systems integrators. He works with AWS partners and customers to provide them with architectural guidance for building scalable architecture in hybrid and AWS environments. He enjoys spending time with his family, outdoor activity, and traveling to new destinations to discover new cultures.

 

 

 

Mona Mona is an AI/ML specialist solutions architect working with the AWS Public Sector Team. She works with AWS customers to help them with the adoption of Machine Learning on a large scale. She enjoys painting and cooking in her free time.