Leveraging Amazon Rekognition and Amazon Comprehend on Dataiku Data Science Platform
By Rumi Olsen, Sr. AI/ML Partner Solutions Architect – AWS
By Meena Thandavarayan, Sr. AI/ML Partner Solutions Architect – AWS
Approximately 80% of an organization’s data is semi-structured or unstructured, according to Gartner. Unlocking the value from unstructured data can help organizations stay competitive and make smarter business decisions.
Many businesses have started to leverage natural language processing (NLP) and computer vision to derive insights from unstructured data such as text, images, and videos. Building a model for such use cases, however, often remains a challenge for citizen data scientists and analysts due to resource constraints.
With deeper integration of artificial intelligence (AI) tools from Amazon Web Services (AWS), Dataiku enables citizen data scientists and analysts to augment their analytics workflow with pretrained NLP and computer vision models.
In this post, we’ll explore how you can use Amazon Comprehend and Amazon Rekognition plugins on Dataiku Data Science Studio (DSS) to build a simple workflow of NLP and computer vision use cases, respectively.
Dataiku is an AWS Partner with the Machine Learning Competency that orchestrates the entire machine learning (ML) lifecycle and makes it accessible to data scientists and analysts alike.
An AI and analytics production platform, Dataiku was named a Leader in both the Gartner 2020 Magic Quadrant for Data Science and Machine Learning Platforms, and the Forrester Wave: Multimodal Predictive Analytics and Machine Learning Solutions, Q3 2020.
AWS AI Services
Amazon Comprehend is a fully managed NLP service that enables you to extract insights from the content of documents. It supports custom classification and enables you to build custom classifiers that are specific to your requirements, without requiring machine learning expertise.
Amazon Rekognition is a fully managed service that provides computer vision capabilities for analyzing images and video at scale, using deep learning technology without requiring ML expertise
Unstructured data is ubiquitous across all industries, and (generally speaking) images and videos are used for computer vision use cases and text data is used for natural language processing.
You want to start from a business problem, and then look for the machine learning use cases that partially or fully addresses the problem. For example, NLP could be used to improve customer satisfaction by finding out a sentiment of the conversation happened between the customer and sales support representative.
Computer vision, meanwhile, can be used to solve production management or quality control in manufacturing.
Here are some other use cases for NLP and computer vision:
- Medical data as images like scan reports and MRIs, patients’ medical history and text records, and other types of imagery or videos for diagnosis.
- Product and customer sentiment is revealed from service center calls, online reviews, social media references, and survey questions.
- Financial sector processes often deal with paper-based invoices, quotes, orders, and receipts such as procure to pay or order to cash processes.
- Energy companies have a wealth of information in well files, pipeline and instrumentation diagrams (PIDs), and images.
- Utility companies rely on manual inspections of captured images of utility infrastructure and their surrounding environments.
By enabling Amazon Comprehend or Amazon Rekognition as plugins in Dataiku, citizen data science teams are enabled with intuitive tools for data wrangling, analyzing, and visualizing unstructured datasets without building own models from scratch. Business analysts, meanwhile, can reveal new insights and make the right decisions based on data.
The architecture diagram in Figure 1 highlights how Dataiku integrates with AWS. There are multiple AWS services plugins available through Dataiku DSS. In this post, we are going to talk specifically about the Amazon Rekognition and Amazon Comprehend plugins.
There are three categories of how AWS services integrate with Dataiku DSS:
- DSS ML platform: Design node, automation node, and API deployer hosted on Amazon Elastic Compute Cloud (Amazon EC2) instances.
- Data lake: AWS Glue as a meta store, and Amazon Athena for SQL queries against Amazon Simple Storage Service (Amazon S3) data.
- AWS AI services: Amazon Rekognition and Amazon Comprehend for ML and AI services.
Figure 1 – Dataiku DSS reference architecture.
Predict Customer Sentiment Using Amazon Comprehend Plugin
Let’s first look at how you can leverage Amazon Comprehend in Dataiku. We’ll be using the Amazon product review dataset where the business goal is to predict customer sentiment so marketing or customer teams can take the best preventative action.
The dataset consists of over 130 million consumer reviews from Amazon.com marketplace over a period of 1995-2015. The dataset includes metadata, but only the review text is needed for this use case.
Amazon Comprehend capabilities in Dataiku allow users to analyze unstructured text content for language detection, named entity recognition, key phrase extraction, and sentiment analysis.
Note that plugins are enabled for Dataiku projects by a Dataiku administrator. If you’re unable to find the plugin in your project, ask your admin to enable the same.
Once installed, the Amazon Comprehend plugin is configured to call the Amazon Comprehend API through your AWS account.
The preset information required to enable this connection from the Amazon Comprehend plugin includes your AWS credentials; namely access key, secret key, and region, and Amazon Comprehend configurations like API quotas and parallelization settings.
Figure 2 – AWS AI services plugin setup.
Once the configuration is complete, you can implement the use case.
Figure 3 – Amazon Comprehend workflow.
Next, load the dataset onto your Dataiku project. Our interest for this use case is the “Text Column” feature of the dataset. This feature has all of the customer comments for the products.
Data preparation is an optional step for this use case. You can choose to filter the products of interest and remove any unwanted columns.
You can now drag and drop Amazon Comprehend to the workflow. For our use case, we’ll use the “Sentiment Analysis” option.
Note that Amazon Comprehend provides capabilities to create custom entity detection and custom classifiers. The initial release of the Amazon Comprehend plugin for Dataiku supports only the built-in classifiers. Future upgrades will include custom classifiers.
Run the workflow and capture the results. The Amazon Comprehend plugin provides the sentiment prediction (“Positive”, “Negative”, “Neutral”) and a sentiment score for every review of the product.
Figure 4 – Amazon Comprehend results.
The sentiment prediction can be charted for every product to identify your business strategy around retire, retain, replace, or re-engineer.
Predicting Unsafe Content Using Amazon Rekognition Plugin
Now, let’s look at how to leverage Amazon Rekognition in Dataiku. We’ll be using COCO dataset where the business goal is to detect objects and moderate unsafe content so customer teams can take action and filter the images.
COCO is a large-scale object detection, segmentation, and captioning dataset. It features 330,000 images, and among that more than 200,000 are labeled. It has 80 object categories with 1.5 million object instances and 250,000 people with key points.
The Amazon Rekognition plugin for Dataiku allows users to analyze images and videos for object detection and labelling, text detection, and unsafe content moderation.
Install the Amazon Rekognition plugin in your Dataiku project, and load the dataset onto your Dataiku project. Our interest for this use case is detecting the objects in the image and identifying if any are unsafe.
You can now drag and drop Amazon Rekognition to the workflow. For our use case, we’ll use the object detection and unsafe content moderation options.
An illustration of an Amazon Rekognition pipeline is shown below. Load the dataset and add the Amazon Rekognition plugin to the workflow; one for object detection and one for unsafe content moderation.
Figure 5 – Amazon Rekognition workflow.
For object detection, the result set provides you with a folder with bounding boxes around the objects detected with a confidence score. In addition, a .csv file is generated with detailed information and response APIs for programmatic use.
Figure 6 – Amazon Rekognition label detection.
For unsafe content moderation, Amazon Rekognition provides a suggestive score for content; a higher score indicates the content is unsafe. With the score, you can create a threshold and choose to blur the image or filter it from processing downstream for any score above the threshold.
With the augmentation of AWS artificial intelligence capabilities in Dataiku Data Science Studio (DSS), citizen data scientists can collaborate on the power of pretrained natural language processing and computer vision models to deliver end-to-end AI projects and initiatives.
To get started, download Dataiku DSS on AWS Marketplace.
Dataiku – AWS Partner Spotlight
Dataiku is an AWS Machine Learning Competency Partner that provides the Data Science Studio (DSS), a development environment for data scientists and data preparation.
*Already worked with Dataiku? Rate the Partner
*To review an AWS Partner, you must be a customer that has worked with them directly on a project.