Why third-party data must be part of your machine learning strategies

By: Uday Narayanan and Mahesh Gurram | October 7th, 2022 

While machine learning (ML) and analytics continue to help drive decision-making for organizations, utilizing only first-party data can lead to data gaps. These gaps may introduce risk or uncertainty. Incorporating third-party data into your analytics and ML strategies can help mitigate these gaps. It does so by refining analytics, delivering effective insights, and bringing confidence to forecasting across your organization.
 
In this blog post, we will show you how you can use third-party data subscribed via Amazon Web Services (AWS) Data Exchange combined with your first-party data to do predictive analysis. We will do this by using Amazon SageMaker and building dashboards with Amazon QuickSight.

Continue reading the article below, or register to download the associated on-demand webinar or eBook.

webinar-icon
While machine learning (ML) continues to help drive decision making for organizations, utilizing only first-party data can lead to data gaps that may introduce risk or uncertainty. Incorporating third-party data into your machine learning strategies can help mitigate these gaps by refining analytics, delivering effective insights, and bringing confidence to forecasting across your organization.

Use-case

For this blog, we will consider a use case for a hypothetical retail company who have built their data lake on Amazon S3. They have their sales data from their multiple locations stored on Amazon S3. They have been seeing spikes in sales on certain days and want to understand if events happening around store are causing these sales spikes. This will also help their marketing team to allocate relevant budgets for upcoming events either to sponsor or allocate advertising dollars. Along with this, they also want to build a machine learning model to predict the number of employees they should have on staff on a given day. They want this number to be based on the events happening around the store. This would enable them to make sure they are not over or understaffed on any given day. In order to get the local event data, we subscribe to the Demand Intelligence – Global Intelligent Event Data API dataset from AWS Data Exchange.

Prerequisites

  • Customer has an AWS account.
  • Customer has relevant permission to subscribe to AWS Data Exchange.
  • Sales data is in Amazon S3 Data Lake.
  • Subscribed to PredictHQ dataset in AWS Data exchange. Details about subscribing to a product can be found here.

Enabling data driven insights through Amazon QuickSight dashboards with third-party data

In our example of a retail company, the customer already has sales data in an Amazon S3 Data Lake. They have also already subscribed to the PredictHQ dataset in AWS Data exchange to get event data.

We will now explore how we can use Amazon QuickSight visual reporting to answer the question: Are the spikes in sales due to events happening around the store?

Below, is the high-level architecture on how to do this:

data lake tech diagram

In summary, we have 4 steps:

  • Import sales data from Amazon S3 Data Lake into Amazon QuickSight.
  • Import event data (PredictHQ dataset from AWS Data Exchange) into a different dataset in Amazon QuickSight.
  • Join sales data from data lake (step 1) and event data (step 2) based on event date and sales date.
  • Visualize sales data in color coding to identify all days when we had events around store.

Step-by-step

Step 1: Import the sales data from customer data lake.

Customer sales data for a specific store in Amazon S3 bucket.

data in s3 bucket

Now let us import sales data into Amazon QuickSight. There are multiple ways to import data from Amazon S3 into Amazon QuickSight. For this case, we will use manifest file. A manifest file is a JSON object, which lets Amazon QuickSight know how to import the file and the corresponding metadata. In this case sales data is in an Amazon S3 bucket. It is also a CSV file, which contains header.

csv header code example

Import sales data into Amazon QuickSight using a manifest file.

setting up new s3 data source screen shot

Now, the sales dataset is imported into Amazon QuickSight.

finish dataset creation screen shot

Step 2: Import event data from Amazon S3 to Amazon QuickSight

Very similarly, we will also import the event data from Amazon S3 into Amazon QuickSight. Now that we have both the sales data and event data in Amazon QuickSight, we will now join the data.

Step 3: Join the sales data and event data in Amazon QuickSight

We can start with the event data, and visualize it, to get started.

Let us add a join condition. In this case, we need all records from “Sales” to see if we had an event on that particular date.

We are going to select “Right Outer Join” (All records from sales and matching records from event, Left table- Event Right table - Sales) with the join condition of event.start = Sales.Date.

Now that we joined the data, we can see the first few rows and dates that have events. Anything where event data is null (bottom half in screenshot), we did not have any events on those particular dates.

joined data tables
Step 3.1: Create new calculated field - “Did event happen?”
 
We need to know if an event happened on a specific day. To do so, we will add a new column titled “Event.” If an event happened around the store on this day, this field will be “Yes,” and it will indicate “No” if there was no event. We do this by adding a “Calculated field.”
 
As we can see, we have the new column added to the end of our dataset.
new column
Step 4: Generate visualization
 
Now we have information about the sales and if an event happened on that date. Now, we will visualize it.
 
The columns we are going to use are the following: Date (From Sales), Sales (From Sales), and Event (Calculated field derived from Events data).
We will also remove any missing or bad data.
data vizualization screenshot
As we can see, the dark blue lines on the graph represents dates when we had events around this store. As we can see in the visualization, we can answer the question “did sales increase due to events around the store” with “yes”.

Using PredictHQ’s third-party data from AWS Data Exchange, we were able to find more insights than using only the sales data. Now the customer understands events impact sales per day, allowing them to make informed business decisions.

Building a machine learning model to predict staff count need

Our client, a retail company, would like to build a machine learning model to predict the number of employees they should staff on any given day. There are different factors which contribute to this. Our client has all the information they need in their data lake, except the external factors that could impact them. For example, local events, such as sporting events or concerts, also contribute to the number of customers that come to the store. As our client does not have this data, they have subscribed to the same PredictHQ’s dataset from AWS Data Exchange we mentioned earlier.
 
We will be following the below approach to build the model:
 
  1. Build a Machine Learning Model using Amazon SageMaker Canvas.
  2. Create a Amazon SageMaker endpoint.
  3. Call the Amazon SageMaker endpoint from an AWS Lambda function.
  4. Front the AWS Lambda function with an API Gateway.
  5. Call the API Gateway from your web application.

Before getting started, we will download the historical event data from AWS Data Exchange. Since the delivery mechanism for the dataset we have subscribed to is API-based, we can use the send-api-asset API call to get the historical data. The extracted third-party data is then to be cleansed and merged with our first-party data. We do this by using an ETL tool such as AWS Glue or Amazon EMR, then store it in Amazon S3. Below is a sample of what our merged data would look like.
merged data
The first two columns are from the third-party dataset from PredictHQ, with information about event and attendance on a given day. The third column is first-party data that shows how many staff should have been employed on that day. This is historical data. For simplicity, we have just kept these three columns. However, the scope for this can be easily expanded.

Step 1: Build a machine learning model using Amazon SageMaker Canvas

Now we have the third-party data from AWS Data Exchange downloaded and available in our Amazon S3 bucket, along with our first-party sales data. This means we are ready to start building the machine learning model. For the purpose of this blog, we will use Amazon SageMaker Canvas to build the model. Amazon SageMaker Canvas expands access to machine learning by providing business analysts with a visual point-and-click interface. This allows them to generate accurate ML predictions on their own — without requiring any machine learning experience or having to write a single line of code. In order to get started with Amazon SageMaker Canvas you can follow the getting started guide.

NOTE: Amazon SageMaker Canvas is not the only option. You can also use Amazon SageMaker AutoPilot, build your model using Amazon SageMaker Studio, or any other methods you currently use to build models.

Once the Amazon SageMaker Canvas environment set up, we will import our dataset. Our dataset is stored in our data lake on Amazon S3.

Once the dataset is imported, we create the model by clicking on the “New Model” option and select the dataset that we just imported. On the Build tab, select staff_count as the column to predict. This is the column our model is going to predict. For Model Type, select Numeric. There are other options like 2 category mode, 3-category model or time series forecasting that can also be chosen based on the use case. Finally, click on Standard Build. There are other build options, but inorder to create a Amazon SageMaker endpoint we will need a Standard build.

NOTE: The model build will take around 2-4 hours to complete.

Step 2: Create an Amazon Sagemaker endpoint

Once the model is built, you can test it out by running predictions within Amazon SageMaker Canvas. For this use case, we will deploy this model as a Amazon SageMaker endpoint. In order to generate the endpoint, use the Analyze tab of the model to click on Share, as shown below. Next, click on the Create SageMaker Studio link. Pasting the link in the browser will open a Model Overview page, which has the Best Model. Amazon SageMaker Canvas ran the dataset through multiple models and did hyperparameter tuning to come up with the best model.  

model overview
Click on the Best model, then click on Deploy Model. Under the Real Time Predictions setting, give your endpoint a name, select the instance type, and choose the instance count. Then, click on Deploy Model. This will deploy the model. The deployed models can be found under SageMaker Endpoints in the AWS console.
 
Step 3: Call the Sagemaker endpoint from an AWS Lambda function
 
Once the endpoint is created, it can be called from within your code. Our client would like to call this endpoint in real time to get real time predictions. For this purpose, we will write a lambda function to call the AWS Data Exchange API to get the third-party data in real time. The below code snippet is taking a date as an input. The code then calls the send_api_asset API to get the events information from PredictHQ. Next it passes that information to the Amazon SageMaker endpoint, which predicts the number of employees that should be on staff on that day.
import boto3
import json
import math

def lambda_handler(event, context):
    
## Get the data from the input from API Gateway

    date_1 = event['description']
    
    total_attendance = 0
    base_path = 'events/'
    
    # Set up the ADX boto3 client
    adx = boto3.client('dataexchange')
    
    # Call the PredictHQ api to get the events on a given date and given geography. We are looking for all events within 3.5 miles radius of the clients store. 
    response = adx.send_api_asset(
        AssetId=<Replace with asset id from your subscription>,
        DataSetId=<Replace with dataset id from your subscription>,
        Method='GET',
        Path=base_path,
        RevisionId=<Replace with the appropriate revision id from your subscription>,
        QueryStringParameters={
            'active.gte': '{}'.format(date_1),
            'active.lte': '{}'.format(date_1),
            'within': '3.5mi@35.993248,-78.9021923'
        }
    )
    
    # Extract the relevant fields
    data = json.loads(response['Body'])
    data_results = data['results']
    
    
    #Loop through all events in date range and get total attendance
    for i in range(len(data_results)):
    
        try :
            attendance = data_results[i]['phq_attendance']
            total_attendance = attendance + total_attendance
            print(attendance)
        except:
            print('Attendance info missing. Considered as 0')
    
    # Configure sagemaker endpoint and to get predictions
    sm_rt = boto3.client('sagemaker-runtime')
    
    print('total_attendance = ', total_attendance)
    
    l = 'concert,' + str(total_attendance)
    
    sm_response = sm_rt.invoke_endpoint(
        EndpointName='webinar-staff-prediction-model', 
        ContentType='text/csv',
        Accept='text/csv',
        Body= l
        )
    
    
    # prediction is string. Extracting prediction and converting to float
    sm_pred = float(sm_response['Body'].read().decode('utf-8'))
    
    # lambda function to set min, max staff counts and to round itto a whole number.
    fn_calc_staff_count = (lambda x : 10 if x < 10 else (100 if x > 100 else round(x)))
    staff_count = fn_calc_staff_count(sm_pred)
    
    text = 'You should have {} employees on staff on {}'.format(staff_count, date_1)
    
    return {
        'statusCode': 200,
        'body': json.dumps(text)
    }

Step 4: Front the AWS Lambda Function with an API Gateway

Once the AWS Lambda function has been created, it can be fronted with an Amazon API Gateway. The details on this configuration is out of scope for this blog. However, you can find the details in the Using AWS lambda with Amazon API Gateway section of the documentation.

Step 5: Call the API Gateway from your web application

Finally, this API Gateway can be called from within your web application to get real time insights from our machine learning model. 

Conclusion

In this blog we saw how you can use third-party data from AWS Data Exchange, along with your first-party data, to build analytics. We did this with Amazon QuickSight and machine learning models using Amazon SageMaker. This analysis would not have been possible if we did not have reliable information from data providers like PredictHQ. The customer also did not have to sign a new contract with providers, as they are already using AWS Data Exchange. By using AWS Data Exchange, the customer’s consolidated monthly AWS bill gives them ease from maintaining yet another contract.

Connect with AWS Data Exchange

Find data sets
Find data sets
Discover and subscribe to over 3,500 third-party data sets.
Speak with a data expert
Get started with AWS Data Exchange
Speak with a data expert to find solutions that enhance your business.
Register for a workshop
Register for a workshop
Get hands-on guidance on how to use AWS Data Exchange.