Data insights from SAP with Amazon SageMaker AutoML and QuickSight

Introduction

Enterprise applications generate a lot of data. Analyzing this data helps stakeholders make informed decisions. Enterprises use AI/ML for automating business processes, finding patterns at scale, and more.

SAP is one of the most extensively used ERP solutions for different industries of varied scales and complexities. SAP systems are often integrated with external systems to pull data into SAP systems for different business processes. Customers need a unified view of their data to drive real-time visibility and better decision-making beyond their SAP ecosystem.

AWS helps customer at every stage of their ML adoption journey with the most comprehensive set of artificial intelligence (AI) and ML services, infrastructure, and implementation resource along with other 200+ AWS services.

In this blog, I will describe and illustrate how you can leverage Amazon Sagemaker and Amazon QuickSight to break data silos and integrate AI/ML and intelligent data visualisation of external data ingested into SAP systems. I will use a publicly available housing dataset ingested into SAP system to predict housing prices for future periods and different locations.

Overview

I will start with data extraction from SAP system with AWS native ETL service Amazon AppFlow and stage it on Amazon Simple Storage Service (Amazon S3). I will use Amazon Sagemaker Autopilot, a fully managed ML development environment to prepare the ML data along with building, training and deploying the ML model. Then I will use QuickSight for intelligent visualization of the predicted data for better analysis. Finally, I will use SageMaker with QuickSight to augment newly extracted SAP data through batch transformation.

Figure 1. Data Pipelines Architecture

Walkthrough

I have divided the solution into 4 steps

Step 1 – Data preparation and feature engineering
Step 2 – Model development, training, tuning and deployment with SageMaker AutoML
Step 3 – Data inference and visualisation of predicted data with QuickSight
Step 4 – Data augmentation with predicted data of newly ingested QuickSight Enterprise edition data with
SageMaker

Prerequisites

An AWS account with appropriate IAM permission to work with Amazon S3, AWS Lambda, SageMaker, Amazon QuickSight, Amazon AppFlow and Amazon Simple Notification Service (Amazon SNS)
SageMaker domain and domain user profile to launch Amazon SageMaker Studio
SAP System as a data source.

Step 1 – Data preparation and feature engineering

I. SAP data preparation and Extraction to AWS

You may use the below two options to prepare your sample SAP dataset

- - SAP NetWeaver Enterprise Procurement Model (EPM), if you want to use procurement scenarios. Expose the Hana CDS view with SAP OData services, which allow us to extract the data with Amazon AppFlow.
  - Public dataset available from Kaggle, if you want to use any publicly available dataset. Load the data from Kaggle into the SAP HANA table. The data from Kaggle is in CSV format and can be imported into SAP HANA using theIMPORT FROM CSVstatement. Create an SAP ABAP CDS view in the ABAP Development Tools (ADT). You can add the annotation@OData.publish:trueto create an OData service, which can be used to extract data with Amazon AppFlow.

There are different options for Data Extraction from SAP systems to AWS, here I am using Amazon AppFlow. Amazon Appflow extracts data from the application layer using SAP OData services, preserves business logic and captures delta changes while also writing back to SAP. For more details on Amazon AppFlow data extraction, please refer to Extract data from SAP ERP and BW with Amazon AppFlow.

I have configured two data flows for data extraction to Amazon S3.

Flow: SAP – Housing Modeling Data (Figure 1) – Data extraction during model preparation/training/retraining lifecycle, runs on- Demand.
Flow: SAP – California Housing Prediction Data (Figure 1) – Data extraction for inference from new data, runs on schedule.

II. ML Data preparation

I am using Sagemaker notebooks to perform feature engineering and split the data into train and test datasets. You may also use Sagemaker Data Wrangler which simplifies the process of data preparation and feature engineering and completes each step of the data preparation workflow from a single visual interface.

Notebook Code Snippet

# import libraries
import boto3, re, sys, math, json, os, sagemaker
from sagemaker import get_execution_role
import pandas as pd

# Define IAM role
role = get_execution_role()

# Download data from S3
bucket_name = 'sagemaker-aiml'
try:
  model_data = pd.read_csv('s3://sagemaker-aiml/california_housing/onpremises-modelling-data/california_housing_data.csv',index_col=0) # SAP data extracted by Amazon Appflow 
  print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

# Split Train and Test Data
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(model_data, test_size=0.2, random_state=42)

# Dropping attribute median_house_value
prediction_data = test_data.drop(['median_house_value'], axis=1)

# Save data to CSV files
train_data.to_csv('automl-train-data.csv', header=True, sep=',') # Need to keep column names
test_data.to_csv('automl-test.csv', header=True, sep=',')
prediction_data.to_csv('prediction-data.csv', header=True, sep=',')

# Upload data to S3
prefix1 = 'sagemaker/automl-dm/input'
prefix2 = 'california_housing/onpremises-prediction-data'
sess   = sagemaker.Session()
uri = sess.upload_data(path="automl-train-data.csv", key_prefix=prefix1, bucket=demo-aiml')
uri = sess.upload_data(path="prediction-data.csv", key_prefix=prefix2, bucket=demo-aiml')

Step 2 – Data modeling, training, tuning and deployment with Amazon Sagemaker AutoML

In Step2, I will use the training data available in the S3 bucket – sagemaker/automl-dm/input/ (Figure 1) to prepare a ML model with Amazon Sagemaker AutoML.

Steps to run AutoML with SageMaker.

Provide a name for the Autopilot experiment
Insert the S3 bucket that is storing the training data under Input data
Prepare and provide an S3 bucket to store AutoML related output files

On the next page, set the target column (attribute to predict) and leave the rest as default.

Choose Auto for the training method and algorithms, and let SageMaker choose the training method based on the dataset.

On the next page, put an endpoint name and leave the rest as default.

Once the experiment is completed, SageMaker will auto deploy the best performing model with the endpoint name we provided in the previous step.

On the SageMaker landing page, navigate to Inference and under Endpoints when the model endpoint status turns in-service, it’s ready to be used through the endpoint.

The below snippet shows how you can invoke the model from your Jupiter Notebook and compare the actual vs predicted values to estimate the accuracy of the model. (Optional)

Notebook Code Snippet

# Predict Test data
ep_name = 'California-Housing-Pricing'
sm_rt = boto3.Session().client('runtime.sagemaker')

# Compare the actual vs predicted value
from sklearn.metrics import mean_squared_error
with open('automl-test.csv') as f:
    lines = f.readlines()
    for l in lines[1:]:   # Skip header
        l = l.split(',')  # Split CSV line into features
        label = l[-2] # Store 'Housing Median Price' label
        proxi = l[-1] # Store 'Priximity' label
        l = l[:-2]   # Remove label       
        l = ','.join(l)   # Rebuild CSV line without label
        l= l +',' + proxi # Rebuild CSV line with Proximity

response = sm_rt.invoke_endpoint(EndpointName=ep_name, ContentType='text/csv', Accept='text/csv', Body=l)
        response = response['Body'].read().decode("utf-8")
        print('Actual Value :', label, ' ', 'Predicted Value:', response )

Step 3 – Data inference and visualisation of predicted data with Amazon QuickSight

Setup your Amazon QuickSight account in the same region where Amazon SageMaker and Amazon S3 were configured.
Create a new dataset

Mention a name for the data source and prepare a json manifest file to input the bucket details that were used to store test data – sagemaker/automl-dm/input/ (Figure 1). Here’s what mymy_s3_manifest.json looks like:

{
    "fileLocations": [
        {
            "URIs": [
                "s3://sourav-aiml/california_housing/onpremises-prediction-data/prediction-data.csv"
            ]
        },
        {
            "URIPrefixes": [
                "s3://sourav-aiml/california_housing/onpremises-prediction-data/prediction-data.csv"
            ]
        }
    ],
    "globalUploadSettings": {
        "format": "CSV",
        "delimiter": ",",
        "textqualifier": "'",
        "containsHeader": "true"
    }
}

Choose Augment with SageMaker

Choose the Model deployed by Amazon SageMaker at Step2.
Provide a name for the Analysis.
I prepared a json document with the attributes as inputs (input data for the model) and output (data we want to predict) fields for the Schema field as shown below.

Here’s what my my_Sagemaker_model_schema.jsonlooks like:

{
    "inputContentType": "CSV",
    "outputContentType": "CSV",
    "input": [
        {"name": "longitude",
         "type": "DECIMAL"
        },
        {"name": "latitude",
         "type": "DECIMAL"
        },
        {"name": "housing_median_age",
         "type": "INTEGER"
        },
        {"name": "total_rooms",
         "type": "INTEGER"
        },
        {"name": "total_bedrooms",
         "type": "INTEGER"
        },
        {"name": "population",
         "type": "INTEGER"
        },
        {"name": "households",
         "type": "INTEGER"
        },
        {"name": "median_income",
         "type": "DECIMAL"
        },
        {"name": "ocean_proximity",
         "type": "STRING"
        }
    ],
    "output": [
        {
            "name": "House Median Value",
            "type": "DECIMAL"
        }
    ],
    "description": "description",
    "version": "v1",
    "instanceCount": 1,
    "instanceTypes": [
        "ml.c5.2xlarge"
    ],
    "defaultInstanceType": "ml.c5.2xlarge"

}

Review the input data mappings, each field in the schema should match a field in the dataset.

Review the output data mappings

As a final step to infer the predicted data, under Visualize, choose Visual types along with the fields to include in the visualization.

Below is a representation of Visual types as tabular with House Median Value vs. all fields.

Visual types – AutoGraph, helps with real time visualisation of data as per coordinates, here I have used House Median Value vs. locations (latitude, longitude)

Step 4. Data augmentation of newly ingested QuickSight Enterprise edition data with Amazon SageMaker

In this section I will show how you can use the ML model prepared in Step 2 for inference of new data from the SAP system.

Amazon AppFlow captures new data from SAP system. Once new data is extracted to the Amazon S3, a Lambda function triggers data ingestion to QuickSight and with batch transformation from Amazon SageMaker, the predicted data is populated in the analysis dashboard with low latency.

Lambda sends a notification once the data ingestion is completed and predicted data is available, it also notifies of any errors during data ingestion.

Here’s what my California_Housing_Data_Ingest_QuickSight lambda function looks like:

import boto3
import uuid

def lambda_handler(event, context):
    Ingestion_Id=str(uuid.uuid4())
    client = boto3.client('quicksight')
    client.create_ingestion( DataSetId='254f371a-65ed-4294-9f3c-71b49d0f8517', IngestionId=Ingestion_Id, AwsAccountId='XXXXXXXXXXXX')
    response_ingestion = client.describe_ingestion( AwsAccountId='XXXXXXXXXXXX', DataSetId='254f371a-65ed-4294-9f3c-71b49d0f8517', IngestionId=Ingestion_Id)
    Ingestion_status = response_ingestion['Ingestion']['IngestionStatus']
    while (Ingestion_status != 'COMPLETED'):
        response_ingestion = client.describe_ingestion( AwsAccountId='XXXXXXXXXXXX', DataSetId='254f371a-65ed-4294-9f3c-71b49d0f8517', IngestionId=Ingestion_Id)
        Ingestion_status = response_ingestion['Ingestion']['IngestionStatus']
    
    client = boto3.client('sns')
    response = client.publish(
    TopicArn='arn:aws:sns:eu-central-1:XXXXXXXXXXXXX:California_Housing_Data_Ingestion_Status',
    Message='Data Ingestion is completed',
    Subject='Data Ingestion Status',
    MessageStructure='string',
    )

Amazon SNS configuration for notification on failure of the Lambda function

Amazon S3 configuration to configure trigger for the Lambda function

Conclusion

Amazon Sagemaker AutoML, a low-code/no-code ML development environment, helps companies to start their ML journey and accelerate delivery of ML solutions down to hours or days without much prior ML knowledge. Amazon SageMaker pay-as-you-go options allow customers to explore AI/ML without any significant upfront investment, increasing business agility. This blog demonstrates one of many possible options to integrate your SAP environment with AWS and use it with AWS broad portfolio of AI/ML and analytics services along with other 200+ AWS services.

You may also explore our other ML services like Amazon Forecast, Amazon Textract, Amazon Translate, Amazon Comprehend which can be integrated with SAP for different use cases.

Visit the AWS for SAP page to learn why thousands of customers trust AWS to migrate and innovate with SAP.

AWS for SAP