AWS Partner Network (APN) Blog

How KNIME Users Can Build Intelligent Workflows By Accessing AWS Services Through Boto3 SDK Integration

By James Yi, Sr. AI/ML Partner Solution Architect – AWS
By Stephen Rauner, Partner Technology Manager – KNIME

KNIME-AWS-Partners-1
KNIME
Connect with KNIME-1

For many organizations, one of the biggest changes brought on by the pandemic is the sense of urgency to modernize and reinvent their business workflows to drastically cut costs, increase agility, and improve productivity.

To quickly build intelligent data-driven workflows, organizations need business analysts to work with data scientists and development teams to unlock useful insights from unstructured or semi-structured data. However, they often struggle to provide a common ground for collaboration.

KNIME is an AWS Partner with a qualified software offering that provides a visual programming approach to data science.

KNIME’s end-to-end data science product portfolio helps bridge the gap between the ideation and productionalization steps of data science projects, while also assisting in the communication of key data science aspects between teams.

With KNIME’s graphical user interface, stakeholders and analysts from all sides of the business will be able to collaborate with data science teams to rapidly build and deploy data-driven solutions that integrate with Amazon Web Services (AWS) decision support tools and services.

In a previous AWS blog post, we introduced several direct integrations for AWS services built natively in the KNIME Analytics Platform:

  • The KNIME Database Extension provides a set of nodes allowing end users to connect and interact with databases in Amazon Relational Database Service (Amazon RDS), Amazon Athena, and Amazon Redshift—all whilst making use of the database’s resources.
  • To make use of AWS artificial intelligence (AI) and machine learning (ML) services, KNIME provides a node extension that allows low-code integration with services such as Amazon Comprehend, Amazon Personalize, and Amazon Translate.
  • Using these components, developers can visually set up no code workflows to translate a text file using Amazon Translate, or advanced text processing workflows using Amazon Comprehend with just a few clicks.

In this post, we will emphasize another powerful integration included in KNIME via its native integration with Python—the AWS SDK for Python (Boto3).

Boto3 SDK Integration with KNIME

The Boto3 software development kit (SDK) provides an object-oriented API that provides low-level access to AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon Elastic Compute Cloud (Amazon EC2), Amazon DynamoDB, Amazon Textract, Amazon Rekognition, and more.

You can find the latest and most up-to-date information, including a list of services that are supported, in the AWS documentation.

The integration with the Boto3 SDK gives KNIME customers access to more than 200+ AWS featured services and expands KNIME’s growing solutions catalog on AWS.

By leveraging AWS, KNIME, and this integration, data science application developers can lower time-to-value with the following approaches:

  • Adopt the KNIME Analytics Platform for development of your workflows. This can be provisioned pre-installed using the KNIME Analytics Platform for AWS on AWS Marketplace, or downloaded locally from the KNIME website.
  • Enhance your applications with 200+ AWS services by using the Boto3 SDK within the KNIME Python nodes.
  • Evaluate and test your workflows within the security of the AWS development environment.
  • Deploy your workflows to a KNIME Server environment for productionalization of workflows. For the various approaches on how to do this, see KNIME Server on AWS Marketplace.
  • Utilize the KNIME Server’s web portal to expose your workflows as applications, abstracting the end-user experience into data applications without the need for web development.

KNIME-Boto3-Analytics-1

Figure 1 – Boto3 SDK integration gives KNIME customers access to 200+ AWS services.

Automatic Invoice Processing

In this section, we’ll share an example addressing automatic invoice processing to demonstrate the integration of KNIME with Amazon Textract through Boto3 SDK.

Documents are a primary tool for recordkeeping, communication, collaboration, and transactions across many industries, including financial, retail, medical, legal, and real estate.

A lot of information is locked in unstructured documents. Receipts and invoices are documents critical to small and medium businesses (SMBs), startups, and enterprises for managing their accounts payable processes. Manual invoice and receipt processing is time-consuming, error-prone, expensive, and difficult to process at scale.

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

Most recently, AWS announced specialized support for extracting data from invoices and receipts using Amazon Textract.

Alongside its integration with AWS services, KNIME provides a robust text-processing extension. This set of nodes allows workflow developers to incorporate analysis and techniques to their textual data, such as text extracted from scanned documents by Amazon Textract and other file formats. Texts can be filtered, tagged, stemmed, and converted into vectors for further machine learning applications.

For convenience, KNIME has installed and pre-configured Boto3 within the KNIME Server on AWS Marketplace. Thus, KNIME users can directly spin up KNIME Server AMI instances with Python and Boto3 already configured.

The latest release contains Boto3 version 1.20.4 and new versions of the Boto3 library are reviewed for subsequent releases of KNIME software.

The integration of KNIME and Amazon Textract through Boto3 can enable different workflows across multiple industries:

  • Mortgage and insurance companies: Automatically process thousands of paper applications and contracts in a few minutes.
  • Retail and financial companies: Intelligently handle receipts and invoices in different formats and send structured results into financial systems.
  • Banking and financial companies: Handle bank statements and financial reports with tables across one or multiple pages.
  • Healthcare industry: Extract useful text contents from tens of thousands of medical study result stored in PDF or Image files.
  • Manufacturing and energy industries: Use the integration together with Amazon Comprehend to generate workflows for domain-specific document classifications.

In this example, we introduce a workflow that automatically backs up invoice and receipt images to the AWS Cloud, analyzes the document, and extracts specific information such as vendor name, price, and invoice number.

Below is the visualization of this workflow in KNIME Analytics Platform. You can directly download this example workflow from KNIME Hub.

KNIME-Boto3-Analytics-2

Figure 2 – Invoice automation processing using Amazon Textract through Boto3 SDK integration.

Users are able to upload invoice images from their local environment with an Amazon S3 connector, and analyze the documents with Amazon Textract through Boto3 SDK leveraging the KNIME Python Script node.

The Amazon Textract output is post-processed in KNIME and then delivered into different channels, driving:

  • Updates on Amazon DynamoDB.
  • Creation of an Excel file output.
  • Reframing the output in JSON format to leverage digital workers (such as Robotic Process Automation or RPA) enabled, for example, by Blue Prism.

KNIME-Boto3-Analytics-3

Figure 3 – Invoice example for the demo of automatic invoice processing.

Automatic invoice processing steps per our example include:

  • Connect to Amazon S3 bucket through Amazon Authentication and an S3 connector.
  • List the invoice/receipt images from the local folder and upload into S3. Users specify the file transfer resource and destination in Transfer Files node.

KNIME-Boto3-Analytics-4

Figure 4 – The configuration in KNIME’s Transfer Files node.

  • Use Python Script node to run the Python code to analyze the invoice image through the Amazon Textract AnalyzeExpense API and return key-value pairs identified from the documents.Below is the sample Python code:
# Takes a field as an argument and prints out the detected labels and values
def get_label_and_value(field):
	if "LabelDetection" in field:
		key = str(field.get("LabelDetection")["Text"])
	else: key = str(field.get("Type")["Text"])

	if "ValueDetection" in field:
	    val = str(field.get("ValueDetection")["Text"])
	else: val = "N/A"
	return key, val

def process_labels_values(response):
	for expense_doc in response["ExpenseDocuments"]:
	    kvs = {}
	    for summary_field in expense_doc["SummaryFields"]:
	    
	    	key, val = get_label_and_value(summary_field)
	    	#print(key, val)
	    	kvs[key] = val
	return kvs

import boto3, sys, re, json
import pandas as pd
from pandas.io.json import json_normalize

# find S3Bucket and S3Key for invoice files
S3Bucket, S3key = input_table_1['Destination Path'][0].split('/',1)[-1].split('/',1)
textract = boto3.client(service_name='textract', region_name='us-east-1')

response = textract.analyze_expense(
	Document={
		'S3Object': {
            'Bucket': S3Bucket,
            'Name': S3key,
            }
            })
kvs = process_labels_values(response)

output_table_1 = pd.DataFrame()
output_table_1 = json_normalize(kvs)
  • Post-process the output from Amazon Textract and send the final results to different destinations. Below is the result in JSON format.

KNIME-Boto3-Analytics-5

Figure 5 – Output of the automatic invoice processing in JSON format.

Conclusion

Nowadays, it’s time-critical for modernized organizations to quickly uncover hidden insights from unstructured or semi-structured data with machine learning, and convert into data-driven intelligent workflows that can be easily adopted by end users.

KNIME removes the friction from building, sharing, and deploying data science work, and enables stakeholders and analysts from all sides of the business to collaborate with data science teams to rapidly build and deploy data-driven solutions

In this post, we have emphasized the powerful integration of KNIME with the AWS SDK for Python (Boto3). This integration gives KNIME customers access to more than 200+ AWS featured services and expands KNIME’s growing solutions catalog on AWS.

The example workflow we shared of automatic invoice processing using Amazon Textract through Boto3 demonstrated how easily KNIME users can use this integration to build intelligent workflows by accessing AWS services through Boto3 SDK integration.

.
KNIME-APN-Blog-Connect-1
.


KNIME – AWS Partner Spotlight

KNIME is an AWS Partner with a qualified software offering that provides a visual programming approach to data science.

Contact KNIME | Partner Overview | AWS Marketplace

*Already worked with KNIME? Rate the Partner

*To review an AWS Partner, you must be a customer that has worked with them directly on a project.