Why third-party data must be part of your machine learning strategies
By: Uday Narayanan and Mahesh Gurram | October 7th, 2022
Continue reading the article below, or register to download the associated on-demand webinar or eBook.

Use-case
For this blog, we will consider a use case for a hypothetical retail company who have built their data lake on Amazon S3. They have their sales data from their multiple locations stored on Amazon S3. They have been seeing spikes in sales on certain days and want to understand if events happening around store are causing these sales spikes. This will also help their marketing team to allocate relevant budgets for upcoming events either to sponsor or allocate advertising dollars. Along with this, they also want to build a machine learning model to predict the number of employees they should have on staff on a given day. They want this number to be based on the events happening around the store. This would enable them to make sure they are not over or understaffed on any given day. In order to get the local event data, we subscribe to the Demand Intelligence – Global Intelligent Event Data API dataset from AWS Data Exchange.
Prerequisites
- Customer has an AWS account.
- Customer has relevant permission to subscribe to AWS Data Exchange.
- Sales data is in Amazon S3 Data Lake.
- Subscribed to PredictHQ dataset in AWS Data exchange. Details about subscribing to a product can be found here.
Enabling data driven insights through Amazon QuickSight dashboards with third-party data
In our example of a retail company, the customer already has sales data in an Amazon S3 Data Lake. They have also already subscribed to the PredictHQ dataset in AWS Data exchange to get event data.
We will now explore how we can use Amazon QuickSight visual reporting to answer the question: Are the spikes in sales due to events happening around the store?
Below, is the high-level architecture on how to do this:

In summary, we have 4 steps:
- Import sales data from Amazon S3 Data Lake into Amazon QuickSight.
- Import event data (PredictHQ dataset from AWS Data Exchange) into a different dataset in Amazon QuickSight.
- Join sales data from data lake (step 1) and event data (step 2) based on event date and sales date.
- Visualize sales data in color coding to identify all days when we had events around store.
Step-by-step
Step 1: Import the sales data from customer data lake.
Customer sales data for a specific store in Amazon S3 bucket.

Now let us import sales data into Amazon QuickSight. There are multiple ways to import data from Amazon S3 into Amazon QuickSight. For this case, we will use manifest file. A manifest file is a JSON object, which lets Amazon QuickSight know how to import the file and the corresponding metadata. In this case sales data is in an Amazon S3 bucket. It is also a CSV file, which contains header.

Import sales data into Amazon QuickSight using a manifest file.

Now, the sales dataset is imported into Amazon QuickSight.

Step 2: Import event data from Amazon S3 to Amazon QuickSight
Very similarly, we will also import the event data from Amazon S3 into Amazon QuickSight. Now that we have both the sales data and event data in Amazon QuickSight, we will now join the data.
Step 3: Join the sales data and event data in Amazon QuickSight
We can start with the event data, and visualize it, to get started.
Let us add a join condition. In this case, we need all records from “Sales” to see if we had an event on that particular date.
We are going to select “Right Outer Join” (All records from sales and matching records from event, Left table- Event Right table - Sales) with the join condition of event.start = Sales.Date.
Now that we joined the data, we can see the first few rows and dates that have events. Anything where event data is null (bottom half in screenshot), we did not have any events on those particular dates.


We will also remove any missing or bad data.

Using PredictHQ’s third-party data from AWS Data Exchange, we were able to find more insights than using only the sales data. Now the customer understands events impact sales per day, allowing them to make informed business decisions.
Building a machine learning model to predict staff count need
- Build a Machine Learning Model using Amazon SageMaker Canvas.
- Create a Amazon SageMaker endpoint.
- Call the Amazon SageMaker endpoint from an AWS Lambda function.
- Front the AWS Lambda function with an API Gateway.
- Call the API Gateway from your web application.

Step 1: Build a machine learning model using Amazon SageMaker Canvas
Now we have the third-party data from AWS Data Exchange downloaded and available in our Amazon S3 bucket, along with our first-party sales data. This means we are ready to start building the machine learning model. For the purpose of this blog, we will use Amazon SageMaker Canvas to build the model. Amazon SageMaker Canvas expands access to machine learning by providing business analysts with a visual point-and-click interface. This allows them to generate accurate ML predictions on their own — without requiring any machine learning experience or having to write a single line of code. In order to get started with Amazon SageMaker Canvas you can follow the getting started guide.
NOTE: Amazon SageMaker Canvas is not the only option. You can also use Amazon SageMaker AutoPilot, build your model using Amazon SageMaker Studio, or any other methods you currently use to build models.
Once the Amazon SageMaker Canvas environment set up, we will import our dataset. Our dataset is stored in our data lake on Amazon S3.
Once the dataset is imported, we create the model by clicking on the “New Model” option and select the dataset that we just imported. On the Build tab, select staff_count as the column to predict. This is the column our model is going to predict. For Model Type, select Numeric. There are other options like 2 category mode, 3-category model or time series forecasting that can also be chosen based on the use case. Finally, click on Standard Build. There are other build options, but inorder to create a Amazon SageMaker endpoint we will need a Standard build.
NOTE: The model build will take around 2-4 hours to complete.
Step 2: Create an Amazon Sagemaker endpoint
Once the model is built, you can test it out by running predictions within Amazon SageMaker Canvas. For this use case, we will deploy this model as a Amazon SageMaker endpoint. In order to generate the endpoint, use the Analyze tab of the model to click on Share, as shown below. Next, click on the Create SageMaker Studio link. Pasting the link in the browser will open a Model Overview page, which has the Best Model. Amazon SageMaker Canvas ran the dataset through multiple models and did hyperparameter tuning to come up with the best model.

import boto3
import json
import math
def lambda_handler(event, context):
## Get the data from the input from API Gateway
date_1 = event['description']
total_attendance = 0
base_path = 'events/'
# Set up the ADX boto3 client
adx = boto3.client('dataexchange')
# Call the PredictHQ api to get the events on a given date and given geography. We are looking for all events within 3.5 miles radius of the clients store.
response = adx.send_api_asset(
AssetId=<Replace with asset id from your subscription>,
DataSetId=<Replace with dataset id from your subscription>,
Method='GET',
Path=base_path,
RevisionId=<Replace with the appropriate revision id from your subscription>,
QueryStringParameters={
'active.gte': '{}'.format(date_1),
'active.lte': '{}'.format(date_1),
'within': '3.5mi@35.993248,-78.9021923'
}
)
# Extract the relevant fields
data = json.loads(response['Body'])
data_results = data['results']
#Loop through all events in date range and get total attendance
for i in range(len(data_results)):
try :
attendance = data_results[i]['phq_attendance']
total_attendance = attendance + total_attendance
print(attendance)
except:
print('Attendance info missing. Considered as 0')
# Configure sagemaker endpoint and to get predictions
sm_rt = boto3.client('sagemaker-runtime')
print('total_attendance = ', total_attendance)
l = 'concert,' + str(total_attendance)
sm_response = sm_rt.invoke_endpoint(
EndpointName='webinar-staff-prediction-model',
ContentType='text/csv',
Accept='text/csv',
Body= l
)
# prediction is string. Extracting prediction and converting to float
sm_pred = float(sm_response['Body'].read().decode('utf-8'))
# lambda function to set min, max staff counts and to round itto a whole number.
fn_calc_staff_count = (lambda x : 10 if x < 10 else (100 if x > 100 else round(x)))
staff_count = fn_calc_staff_count(sm_pred)
text = 'You should have {} employees on staff on {}'.format(staff_count, date_1)
return {
'statusCode': 200,
'body': json.dumps(text)
}
Step 4: Front the AWS Lambda Function with an API Gateway
Once the AWS Lambda function has been created, it can be fronted with an Amazon API Gateway. The details on this configuration is out of scope for this blog. However, you can find the details in the Using AWS lambda with Amazon API Gateway section of the documentation.
Step 5: Call the API Gateway from your web application
Finally, this API Gateway can be called from within your web application to get real time insights from our machine learning model.
Conclusion
Connect with AWS Data Exchange
