Intelligent processing of energy industry PDF reports with Amazon Textract

Customers in the energy and utilities industry process PDF documents in a number of use cases. Here are two examples we have worked on:

Example 1 – In upstream and midstream, service companies such as SPL Inspection Services generate PDF reports during meter inspections. They send these reports to customers as email attachments. Customers may want to digitally extract data points from these reports and build business analytics over time. We will use a sample SPL report to show how this can be done.
Example 2 – In commodity and energy trading, counterparties generate PDF reports for contract confirmations and trade validations, then send those reports via emails. Once customers receive these emails and attachments, they sometimes enter a 30- to 50-character long alphanumeric confirmation code from the PDF to initiate the trade validation process. Data entry errors can cause delay and complexity.

Today, a sample customer workflow may be a variation of the steps below:

External vendors, counterparties, or customers send PDF attachments in emails.
Someone within the customer’s team manually downloads PDF attachments.
These PDF attachments are manually uploaded to a central place for cross-team collaboration and record trail and storage. This central place could be a repository or a file share folder.
Relevant teams access these reports and read and analyze them.
Information is manually entered into other digital systems, which could be databases, for further processing, such as analytics reporting or business transactions.

In this blog, we show an architecture for intelligent processing of PDF reports using Amazon Textract. With Amazon Textract, customers can extract text, data points, location, grade, or other important information from PDF documents. Key customer benefits of our architecture include automation of the manual workflow steps above, building analytics capabilities, and transforming business processes. This also eliminates the need to use cumbersome PDF parsing libraries and manage exceptions in those.

Time to read	7 minutes
Learning level	Advanced (300)
Services used	Amazon Textract, Amazon S3, Amazon SES, Amazon Route 53, AWS Lambda, Amazon DynamoDB, Amazon Athena, Amazon QuickSight

Overview of solution

Architecture Diagram 1 shows three capabilities in this solution:

Automatically extract PDF attachments in emails and store PDFs in Amazon S3.
Extract information from PDF reports using Amazon Textract and store it in Amazon DynamoDB.
As new data arrive, generate analytics reports with Amazon Athena and Amazon QuickSight.

Amazon Textract can extract all or a subset of values from reports. These values are stored in a key-value NoSQL database called Amazon DynamoDB. Customers can build a dashboard in Amazon QuickSight and visualize these values for trending and anomalies.

Amazon textract visualization

Diagram 1 – Reference architecture: Store and visualize information from PDFs using Amazon Textract, Amazon DynamoDB and Amazon QuickSight

Capabilities in this solution can be decoupled, meaning customers can pick and choose the most relevant part to them. For example, we talked to customers who are familiar with Amazon Textract, but they were interested in the capability to process emails and attachments. Other customers already have analytics pipelines and reporting tools, but they really need a way to digitally extract information from documents. Customers have the flexibility to mix and match and enhance this solution.

Walkthrough

Step 1. Process email attachments

An email receiving pipeline architecture can be implemented, as shown in Diagram 2, to automate email handling. Customers can set up a custom domain using AWS services Amazon SES and Amazon Route 53. Please refer to the implementation guide for more details. Emails will land in a customer-owned Amazon Simple Storage Service (Amazon S3) bucket. We highly recommend customers encrypt data at rest. One way to do this is to encrypt the S3 bucket using Amazon KMS with CMK (Amazon Key Management Service with Customer Managed Keys).

aws lambda reference architecture

Diagram 2 – Reference architecture: Set Up an Email-Receiving Pipeline

AWS Lambda then processes the email as a JSON message and stores the PDF document attachment in the S3 bucket. Please see the following sample code.

import json
import boto3
import email
import os
def lambda_handler(event, context):
    # Initiate boto3 client
   s3 = boto3.client('s3')
    # Get the s3 object contents
    objectData = s3.get_object(Bucket=<bucket_name>, Key=<item_name>)
    emailContent = objectData['Body'].read().decode("utf-8")
    # Given the s3 object content is the ses email, get the message content and attachment using email package. Code below is for extracting one attachment.
    message = email.message_from_string(emailContent)
    try:
        attachment = message.get_payload()[1]
        # Write the attachment to a temporary location
        open('/tmp/<file>.pdf', 'wb').write(attachment.get_payload(decode=True))
        # Upload the file at the temporary location to destination s3 bucket
        try:
            s3.upload_file('/tmp/<file>.pdf', '<bucket_name>', '<key_name>')
        except FileNotFoundError:
            console.log("<failure message>")
        # Clean up the file from temporary location
        os.remove('/tmp/<file>.pdf')
        return {
            'statusCode': 200,
            'body': json.dumps('<success message>')
        }
    except:
        #Handle exception

Step 2. Extract information from PDF reports

There are two ways for the solution to start processing PDF documents. One scenario is that customers receive emails with PDF attachments. Emails are processed, and PDF reports are stored in an S3 bucket as described above. The other scenario is that customers already have PDF reports stored somewhere today. They can batch upload these PDF reports to an S3 bucket.

Convert PDF to image
We use a sample report provided by SPL for testing purposes. See the sample report below and the highlighted parts are the example fields and values extracted and stored.

calibration report

Sample SPL report – all information and numbers are illustrative only

When PDF attachments are processed from the email and stored in an S3 bucket, an S3 object creation event is generated. This event will trigger the AWS Lambda function in Diagram 1. Refer to the sample code below, in the Lambda function, the pdfium library will take the PDF document and convert each single page into an image. This image is stored in a temporary location. Then Lambda will call synchronous Textract API’s to process the image and return the values. On synchronous and asynchronous use cases and patterns, please see more information in the Conclusion section. Image conversion from PDF can also handle edge cases such as extracting 30- to 50-character long alphanumeric strings. If customers have PDF with multiple pages, the Lambda function in the solution will need to loop through all pages till completion.

from trp import Document
import pypdfium2 as pdfium
import boto3

doc = pdfium.FPDF_LoadDocument(<pdf_file>, None)
page_count = pdfium.FPDF_GetPageCount(doc)
with pdfium.PdfContext(pdf_file) as pdf:
#iterate over all pages and save in a temporary location
    pil_image = pdfium.render_page(pdf, page_index = page_count)
pil_image.save('/tmp/<page_name>')
#upload to S3 for processing via Textract
s3.upload_file('/tmp/<page_name>', '<bucket_name>', '<key>')
    pil_image.close()

We tried two different libraries to convert the PDF to an image during development. We tested the pypdfium2 and poppler libraries to convert the PDF to an image. We chose the pypdfium2 library because of the ease of package deployment. Please refer to the steps to deploy AWS Lambda using .zip file archives with dependencies. We used Python 3.7 runtime in the Lambda environment. Please double-check the compatibility of your python version and the pypdfium2 library being installed in your own implementation. We are using the Amazon Textract Results Parser library for ease of parsing the response returned by the AnalyzeDocument API.

client = boto3.client('textract')
response = client.analyze_document(Document={
'S3Object': {
'Bucket': "<bucket_name>",
'Name': "<document_name>"
}
},
FeatureTypes=['FORMS', ‘TABLES’])
                docData = Document(response)

Extract data using Amazon Textract and write to DynamoDB
After converting PDF into JPEG, the Lambda function will process the JPEG file(s) using Amazon Textract. The Lambda function calls the AnalyzeDocument API on the JPEG. In the sample code above, we use both the “Forms” and “Tables” FeatureTypes. The “Forms” FeatureTypes can extract information in the highlighted fields of the sample SPL report. The “Tables” FeatureType can be used to extract checkboxes and values in tables. For checkboxes, there is an Amazon Textract feature for selection elements. For the values in the table, detected tables are returned as Block objects in the responses. Refer to the documentation for more details.

The API call returns a JSON key-value message, which is written to an Amazon DynamoDB table using the PutItem API. Compare the item in DynamoDB with the fields highlighted in the report.

We are using Amazon DynamoDB in this solution to handle anticipated variations in PDF formats. DynamoDB adds a column to the table if the report template changes or if there’s a new value to be stored.

Example of items in DynamoDB tale – data for illustration only

Human in the loop

There are scenarios where customers may consider augmenting AI with additional human review. One scenario is that customers may define a range of acceptable values for key elements based on their own business process. This could involve inspection of physical assets and temperature to determine if they are within manufacturing limits. Another scenario worth noting is confidence scores inherent to machine learning technologies. For example, in financial transactions such as energy and commodity trading, customers expect contract numbers and counterparties to be exact. If your use cases have these requirements, customers can consider adding Amazon Augmented AI (Amazon A2I). Customers can create a human review workflow and send a document to a human reviewer if important key values are below certain confidence thresholds. After human review, customers can decide to continue as usual; or if the values are not expected, future review or reprocessing is required.

Step 3. Data analytics and visualization

In the analytics layer, our solution uses Amazon Athena federated query to directly query data stored in DynamoDB using SQL. Amazon Athena has a data connector to DynamoDB. For our sample SPL report, Amazon Textract returns the tables, which we store in DynamoDB as a nested array of key-value pairs per column, for example, “Static Pressure:Found:Test: {[value 1], [value 2], …}”. Additional parsing will be required for different customer use cases. We considered an alternative architecture which customers can evaluate for their own situations. For example, customers can export DynamoDB table to S3 and then use Athena to build interactive queries on data in S3. This design could be useful to customers who already have a data lake in S3. For our use case, we preferred to maintain the flexibility in the DynamoDB table and use Athena to interactively update the query for required data.

For data visualization, we built a dashboard using Amazon QuickSight. Customers can configure QuickSight to use Athena as a data source. Depending on the customer’s use case, you can choose to schedule data refresh in a time interval. With the feature release of Amazon QuickSight SPICE Incremental Refresh, customers can schedule incremental data refreshes as granular as every 15 minutes. Another feature to consider is Amazon QuickSight Q. Customers can use natural language queries to build a graph automatically.

quicksight dashboard

Example of QuickSight dashboard – illustration only

Conclusion

In this blog, we presented an intelligent document processing architecture for energy industry use cases using Amazon Textract. We showed capabilities to process PDF document attachments in emails, extract information from PDF, and build analytics reporting.

There are other feature ideas for customers to evaluate in their own specific use cases:

We used synchronous APIs and image conversion in this solution. Please be aware of the 10 MB size limit using PDF and JPEG files in synchronous operations. If customers have use cases where the PDF documents have a large number of pages, we recommend customers explore testing with Amazon Textract APIs in an asynchronous fashion, for example, StartDocumentAnalysis API. For these use cases, customers may need to consider modifying the architecture with additional AWS services such as Amazon SQS for queuing, especially if customers have a large number of documents, and AWS Step Functions to coordinate Lambda function executions. Please see a reference architecture on how you may design for these use cases. Our customer TC Energy also deployed a solution that could be relevant to your use case.
To provide internal audiences, such as business users, with visual access to data, you can evaluate embedding the QuickSight dashboard in your internal web app.
If customers have use cases beyond email attachments, you can develop an internal web app for business users to upload documents without providing backend access.

For additional use cases, customers can check out AWS Intelligent Document Processing solution. If you have other ideas or feedback, please reach out to us or leave a comment. To learn more about how you can transform the core and build the future of your energy business, see AWS Energy.

AWS for Industries