AWS for Industries
Create a 360-degree view of your consumers using AWS Entity Resolution and Amazon Neptune
Marketers and advertisers need a unified view of consumer data to drive highly relevant marketing and advertising experiences across web, mobile, contact center, and social media channels. For example, if a consumer is shopping for a pair of sneakers on a brand’s website, the marketer would like to surface the most relevant products to save the consumer time and effort. According to McKinsey, 71 percent of consumers expect brands to deliver personalized interactions, and 76 percent of consumers get frustrated when they do not get personalized interactions. However, to deliver a personalized experience, companies need to ingest, match, and query consumer data across multiple touchpoints, including web, mobile, email, social media, and other channels, to create a unified view to understand the consumer.
This blog post describes a composable architecture pattern on Amazon Web Services (AWS) that helps data engineering teams build ingestion, matching, and querying solutions to empower marketers with a 360-degree view of their consumers. In this blog, you will learn how to connect related consumer information to develop a unified view of the consumer with higher accuracy, lower costs, and complete configurability using AWS Entity Resolution—which helps you more easily match, link, and enhance related customer, product, business, or healthcare records stored across multiple applications, channels, and data stores—and Amazon Neptune, a serverless graph database designed for superior scalability and availability.
A unified consumer view helps marketers deliver highly accurate personalization campaigns, thereby increasing consumer engagement and brand trust. Today, companies spend months of development time building data matching and data querying solutions to connect related consumer records gathered through different channels, such as email interactions, in-store purchases, and website visits. Additionally, once built, these solutions need to be kept up to date with the latest changes in first-party data management systems. For example, a system must continually connect incoming, anonymous consumer records with known consumer identities. These solutions are costly to build, maintain, and keep up to date with changes in consumer data, which means they can be inaccurate, less durable, and inflexible to use and maintain.
Introduction to the AWS services used
AWS Entity Resolution offers advanced matching techniques, such as rule-based, machine learning (ML) model–powered, and data service provider matching to help you more accurately link related sets of consumer information, product codes, or business data codes.
Amazon Neptune is a managed graph database that supports applications with highly connected datasets with millisecond latencies, such as mapping consumer behaviors to campaigns, building recommendation engines, representing consumer journeys, and visualizing a unified view of the consumer. A 360-degree consumer graph in Amazon Neptune can drive more accurate results for product or content recommendation and householding by traversing consumer behaviors and relationships.
Illustrative example: Connecting anonymous consumer data with known consumer data using first-party identifiers
More often than not, users interacting with digital assets are anonymous visitors. They navigate through various pages on the web or on mobile without sharing any identifiable information such as name, email, or phone, so they are categorized as anonymous visitors. Anecdotal data from Forbes indicates that about 90 percent of website traffic is composed of anonymous visitors. The first step for identifying anonymous visitors is capturing website traffic. You can capture identified user and anonymous visitor traffic through clickstream events from the web or mobile and store the events in data lakes such as Amazon Simple Storage Service (Amazon S3), which is an object storage built to retrieve virtually any amount of data from anywhere. Next, you need to match and link these anonymous visitors with other known visitor records to develop a complete, or unified, view of your consumers. With a comprehensive understanding of their consumers, marketers can then create personalized messages and campaigns to increase awareness and engagement.
Let’s understand this with an example. If someone visits your page a couple of times in a day but does not sign in or perform any interaction that requires providing personal information (such as an email address), you have a series of clickstream data but no identifying record to attribute these events to a specific user. This is reflected by Click 1 and Click 2 in figure 1 below. However, as soon as the user signs in or makes a purchase (Click 3), he or she provides identity information, which gives an opportunity to attribute all the historical clickstream events to that user and understand his or her access pattern better.
Note that in the figure below, a MatchRule1 represents a rule configured within AWS Entity Resolution to link the incoming events and match them to a particular group (Group 1).
Taking this example further in figure 2, if multiple users from the same household access your page from a common device (Click 4) or different devices (Click 5), event records that are collected from users’ clickstream data can be used to link the sessions together. This linkage gives more information about the household consumer journey.
To build such a solution, consider this high-level design, where the clickstream events originate from a website or an app and stream into your data lake. As these events arrive, AWS Entity Resolution resolves the events into their appropriate match groups using the rules that you have defined within the service. A match group is a group of records resolved to belong together. This output from AWS Entity Resolution serves as an input for Amazon Neptune, which builds a property graph to understand consumer relationships. Figure 3 below describes an overall architecture and design to implement such a solution. The property graph within Amazon Neptune acts as the basis for performing analytics and answers questions like the following:
- What are the consumer’s underlying metadata elements?
- How many devices are shared between consumers?
- What are the various addresses shared by multiple consumers connected by a shared device?
A typical clickstream event generated through a user interaction contains information such as the IP address, the timestamp of the event, and the device information (such as the device family, the browser and operating system and their versions, and so on). Additionally, it contains elements such as the login_id
, but the value is empty for an anonymous user unless they sign in or perform an action that leads to this value being non-empty. Let us consider the following schema and sample records that represent clickstream event:
Sample event records
event_id | user_agent | accept_language | ip_address | zip_code | timestamp | login_id | s_cookie |
Click1 | Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6 | en-US | 192.168.117.155 | 1408 | 8/22/2023 13:05 | ||
Click2 | Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6 | en-US | 192.168.117.155 | 1408 | 8/22/2023 13:06 | fb89a211-78b7-42f4-a792-9538bfb2c13f |
For AWS Entity Resolution to resolve this data schema, you first need to create a table in AWS Glue, a serverless data integration service. This table points to the Amazon S3 bucket that holds the incoming clickstream data. Next, you need to define a schema mapping within the AWS Entity Resolution that informs the service on how to interpret the data. Because several of these attributes of a clickstream event are not personal identifiable information but are important for resolution, they are marked as Custom String
with an appropriate MatchKey
name. The below figure shows a schema mapping defined within AWS Entity Resolution for the clickstream event.
A few things to note: three fields have the same MatchKey
userIdentifier because one or more of these fields are important in resolving the match group of the user. Consider the following scenarios:
- two or more events possibly belong to the same match group if their IP address (ip_address) is the same
- two or more events belong to the same user if their cookie ID (s_cookie) is the same
- two or more events belong to the same user if they have the same login ID (login_id)
Thus, the same MatchKey
for all the three fields lets the service compare one or more of these fields during the resolution process.
A couple of fields, namely timestamp and page_class, have been marked as pass-through columns because they do not participate in the resolution but may be required later, when the output from AWS Entity Resolution is ingested into downstream sources such as Amazon Neptune.
Next, you need to create a rule-based matching workflow within the AWS Entity Resolution service. This workflow is set up with the clickstream data (represented as an AWS Glue table) as the input source along with the schema mapping defined earlier. A processing cadence of Automatic is selected to verify that as new data arrives, the matching workflow continually keeps resolving the new data against the previously resolved match groups. During this process, the service identifies if the new data belongs to an existing match group; if not, the service forms a new match group.
Please note that the input source Amazon S3 bucket should have the notification in Amazon EventBridge, a serverless event bus, turned on for the Automatic processing cadence.
You need to create a rule-based matching workflow within AWS Entity Resolution with a single rule that uses the userIdentifier
, as shown in the figure below. This rule evaluates the incoming clickstream data to link and match records with the same characteristics to a single match group. All records within the same match group are assigned the same MatchID. A MatchID is the unique ID generated by AWS Entity Resolution and applied to all the records within each match group.
The AWS Entity Resolution service writes the output of the matching workflow in an Amazon S3 bucket specified during the workflow creation process.
The output of the matching workflow contains all the input fields (by default) along with other system-generated fields: MatchRule
, MatchID
, and InputSourceARN
. The MatchRule represents the name of which rule, if any, is responsible for the match to occur, while MatchID is the unique ID generated and assigned by the AWS Entity Resolution service to each record. If two or more records are matched on a rule, they belong to the same match group and have the same MatchID.
Explanation of the results
For easy readability, we are including only a few of the columns in this example.
During the Run-1 of the matching workflow, there are 2 clickstream events (Click 1, Click 2) that are grouped together based on their IP address 192.168.117.155
because of the MatchRule, Rule 1
, that uses either ip_address, s_cookie, or login_id to group the events together with the same MatchID.
Input (Run-1):
event_id | user_agent | ip_address | zip_code | timestamp | login_id | s_cookie | |
Click1 | Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6 | en-US | 192.168.117.155 | 1408 | 8/22/2023 13:05 | ||
Click2 | Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6 | en-US | 192.168.117.155 | 1408 | 8/22/2023 13:06 | fb89a211-78b7-42f4-a792-9538bfb2c13f |
Output (Run-1):
recordid | event_id | matchrule | ip_address | login_id | s_cookie | timestamp | user_agent | matchid |
Click1 | Click1 | Rule 1 | 192.168.117.155 | 8/22/2023 13:05 | Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6 | a3806aa4fb593e7f856fba938c24ab19 | ||
Click2 | Click2 | Rule 1 | 192.168.117.155 | fb89a211-78b7-42f4-a792-9538bfb2c13f | 8/22/2023 13:06 | Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6 | a3806aa4fb593e7f856fba938c24ab19 |
Subsequently, as new clickstream events arrive, one of the events has the same ip_address as Click1 and Click2 and contains a login_id. AWS Entity Resolution matches the new event (Click3) with the previous two events and inherits the same MatchID, as seen in the output table. Additionally, the output also contains previously associated records of that MatchGroup (Click1 and Click2) with only “recordid” populated and all other associated columns empty.
Input (Run-2):
event_id | user_agent | ip_address | timestamp | login_id | s_cookie |
Click3 | Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6 | 192.168.117.155 | 8/23/2023 13:38 | john@doe.com | fb89a211-78b7-42f4-a792-9538bfb2c13f |
Output (Run-2):
recordid | event_id | matchrule | ip_address | login_id | s_cookie | timestamp | user_agent | matchid |
Click1 | Rule 1 | a3806aa4fb593e7f856fba938c24ab19 | ||||||
Click2 | Rule 1 | a3806aa4fb593e7f856fba938c24ab19 | ||||||
Click3 | Click3 | Rule 1 | 192.168.117.155 | john@doe.com | fb89a211-78b7-42f4-a792-9538bfb2c13f |
8/23/2023 13:38 | Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6 | a3806aa4fb593e7f856fba938c24ab19 |
Data connections within Amazon Neptune
Once data has been matched by AWS Entity Resolution and stored in Amazon S3, marketers can build downstream applications to derive actionable insights with a consolidated view of consumers. Product recommendation, product activation, and consumer householding are common use cases that link disambiguated users and their captured behaviors. Amazon Neptune is one such service that can help identify additional relationships and links between disambiguated entities.
For example, take the recommendation domain. A common way to make recommendations is collaborative filtering, a technique grounded in the analysis of user behavior data, such as product ratings. A notable challenge in collaborative filtering is the “cold start” problem, which arises when a recommendation engine is unable to generate suggestions for users without a historical data footprint. This issue is particularly prevalent in scenarios involving anonymous consumers, as in the case of clickstream data sources. The cold start challenge can be addressed by capturing disambiguated new anonymous customers, potential households, and their behaviors and “linking” them to a known user profile. AWS Entity Resolution can disambiguate the known and unknown users, and a graph database like Amazon Neptune can generate relationships to form a historical footprint. Figure 1 provides an illustrative workflow for how a graph database forms a single view of a consumer with anonymous and known user events. Amazon Neptune can store the links between Click1 (anonymous metadata), Click2 (anonymous metadata), and Click3 (known user login, jon@doe.com).
Let’s see how to move data from AWS Entity Resolution to Amazon Neptune. For this part of the solution, you will need an Amazon Neptune cluster to store the disambiguated data. To bulk load data, query, and visualize the graph, you will need Amazon Neptune workbench—an interactive development environment (IDE) that provides Jupyter and Jupyter notebooks for running and visualizing code, hosted on Amazon SageMaker, which is used to build, train, and deploy ML models for any use case. The IDE also provides “notebook magics” that can simplify queries and Amazon Neptune management operations. An Amazon Neptune cluster and an Amazon Neptune workbench can be created using this Quick Start template for AWS CloudFormation, which lets you model, provision, and manage AWS and third-party resources by treating infrastructure as code. Users are responsible for the charges for any AWS services used in the example. For help with estimating costs, visit the AWS Pricing Calculator.
Choosing a graph data model depends on the questions that we want to answer. We will use a graph model to help answer the question, “How many consumers or groups share a device?” This will create a data model with four nodes and two edges.
Nodes
- Group: The MatchID assigned by AWS Entity Resolution
- Device: The combined device details from the clickstream event
- IP: The IP address from the clickstream event
- Login: A user’s login ID
Edges
- HAS_IP: The relationship from the Group node to an IP node
- HAS_DEVICE: The relationship from a Group node to a Device node
- HAS_LOGIN: The relationship from a Group node to a Login node
Once the data model has been created, we can begin transforming the data from the AWS Entity Resolution output Amazon S3 bucket into the defined data model in the Amazon Neptune workbench. The code samples below will transform the data into the Gremlin bulk loader format and write back into the Amazon S3 bucket under the bulkload/nodes or bulkload/edges prefix.
Create nodes and edges.
Once all transformations are complete, perform a bulk load with notebook magics. You will need the Amazon Neptune IAM role and the Amazon S3 bucket details to perform the load.
If the load fails, validate the status and errors of the load using the load_status notebook magic.
Visualization and querying of disambiguated identities
Amazon Neptune supports multiple visualization options, including Graph Explorer, an open-source graph visualization tool, and Amazon Neptune workbench. The Amazon Neptune workbench also supports graph query language exploration in addition to running code. This blog will showcase example queries and graph visualizations in the Amazon Neptune workbench to answer some common questions and understand the interconnections between the resolved identities.
“Given a group, find the metadata elements associated with the groups and the RecordIDs that were consolidated by AWS Entity Resolution.”
Results:
Group | MetadataType | ElementId | RecordId |
a3806aa4fb593e7f856fba938c24ab19 | IPAddress | IP-192.168.117.155 | [Click1, Click2, Click3] |
a3806aa4fb593e7f856fba938c24ab19 | Device | Device-Mobile Safari4.0.5iOS | [‘670f456b-3182-41b9-8df7-99cb1a2730a4’, ‘Click1’, ‘Click2’, ‘Click3’] |
a3806aa4fb593e7f856fba938c24ab19 | Login | Login-john@doe.com | [Click3] |
Visualization:
“Given a group, find all other groups or consumers that share the same Device.”
Results:
Group | Device |
f10509080d994778a8989142ca0891ea | IE7.0Windows |
16b102c92ec84f2780a9ad0e807c8370 | IE7.0Windows |
849f2dbd7f1b4d23ac1abc82bc84bbb5 | IE7.0Windows |
Visualization:
“Given a Device, get all groups that have used the device and all IP addresses associated with the group.”