Create a 360-degree view of your consumers using AWS Entity Resolution and Amazon Neptune

Marketers and advertisers need a unified view of consumer data to drive highly relevant marketing and advertising experiences across web, mobile, contact center, and social media channels. For example, if a consumer is shopping for a pair of sneakers on a brand’s website, the marketer would like to surface the most relevant products to save the consumer time and effort. According to McKinsey, 71 percent of consumers expect brands to deliver personalized interactions, and 76 percent of consumers get frustrated when they do not get personalized interactions. However, to deliver a personalized experience, companies need to ingest, match, and query consumer data across multiple touchpoints, including web, mobile, email, social media, and other channels, to create a unified view to understand the consumer.

This blog post describes a composable architecture pattern on Amazon Web Services (AWS) that helps data engineering teams build ingestion, matching, and querying solutions to empower marketers with a 360-degree view of their consumers. In this blog, you will learn how to connect related consumer information to develop a unified view of the consumer with higher accuracy, lower costs, and complete configurability using AWS Entity Resolution—which helps you more easily match, link, and enhance related customer, product, business, or healthcare records stored across multiple applications, channels, and data stores—and Amazon Neptune, a serverless graph database designed for superior scalability and availability.

A unified consumer view helps marketers deliver highly accurate personalization campaigns, thereby increasing consumer engagement and brand trust. Today, companies spend months of development time building data matching and data querying solutions to connect related consumer records gathered through different channels, such as email interactions, in-store purchases, and website visits. Additionally, once built, these solutions need to be kept up to date with the latest changes in first-party data management systems. For example, a system must continually connect incoming, anonymous consumer records with known consumer identities. These solutions are costly to build, maintain, and keep up to date with changes in consumer data, which means they can be inaccurate, less durable, and inflexible to use and maintain.

Introduction to the AWS services used

AWS Entity Resolution offers advanced matching techniques, such as rule-based, machine learning (ML) model–powered, and data service provider matching to help you more accurately link related sets of consumer information, product codes, or business data codes.

Amazon Neptune is a managed graph database that supports applications with highly connected datasets with millisecond latencies, such as mapping consumer behaviors to campaigns, building recommendation engines, representing consumer journeys, and visualizing a unified view of the consumer. A 360-degree consumer graph in Amazon Neptune can drive more accurate results for product or content recommendation and householding by traversing consumer behaviors and relationships.

Illustrative example: Connecting anonymous consumer data with known consumer data using first-party identifiers

More often than not, users interacting with digital assets are anonymous visitors. They navigate through various pages on the web or on mobile without sharing any identifiable information such as name, email, or phone, so they are categorized as anonymous visitors. Anecdotal data from Forbes indicates that about 90 percent of website traffic is composed of anonymous visitors. The first step for identifying anonymous visitors is capturing website traffic. You can capture identified user and anonymous visitor traffic through clickstream events from the web or mobile and store the events in data lakes such as Amazon Simple Storage Service (Amazon S3), which is an object storage built to retrieve virtually any amount of data from anywhere. Next, you need to match and link these anonymous visitors with other known visitor records to develop a complete, or unified, view of your consumers. With a comprehensive understanding of their consumers, marketers can then create personalized messages and campaigns to increase awareness and engagement.

Let’s understand this with an example. If someone visits your page a couple of times in a day but does not sign in or perform any interaction that requires providing personal information (such as an email address), you have a series of clickstream data but no identifying record to attribute these events to a specific user. This is reflected by Click 1 and Click 2 in figure 1 below. However, as soon as the user signs in or makes a purchase (Click 3), he or she provides identity information, which gives an opportunity to attribute all the historical clickstream events to that user and understand his or her access pattern better.

Note that in the figure below, a MatchRule1 represents a rule configured within AWS Entity Resolution to link the incoming events and match them to a particular group (Group 1).

Taking this example further in figure 2, if multiple users from the same household access your page from a common device (Click 4) or different devices (Click 5), event records that are collected from users’ clickstream data can be used to link the sessions together. This linkage gives more information about the household consumer journey.

To build such a solution, consider this high-level design, where the clickstream events originate from a website or an app and stream into your data lake. As these events arrive, AWS Entity Resolution resolves the events into their appropriate match groups using the rules that you have defined within the service. A match group is a group of records resolved to belong together. This output from AWS Entity Resolution serves as an input for Amazon Neptune, which builds a property graph to understand consumer relationships. Figure 3 below describes an overall architecture and design to implement such a solution. The property graph within Amazon Neptune acts as the basis for performing analytics and answers questions like the following:

What are the consumer’s underlying metadata elements?
How many devices are shared between consumers?
What are the various addresses shared by multiple consumers connected by a shared device?

A typical clickstream event generated through a user interaction contains information such as the IP address, the timestamp of the event, and the device information (such as the device family, the browser and operating system and their versions, and so on). Additionally, it contains elements such as the login_id, but the value is empty for an anonymous user unless they sign in or perform an action that leads to this value being non-empty. Let us consider the following schema and sample records that represent clickstream event:

Sample event records

event_id	user_agent	accept_language	ip_address	zip_code	timestamp	login_id	s_cookie
Click1	Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6	en-US	192.168.117.155	1408	8/22/2023 13:05
Click2	Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6	en-US	192.168.117.155	1408	8/22/2023 13:06		fb89a211-78b7-42f4-a792-9538bfb2c13f

For AWS Entity Resolution to resolve this data schema, you first need to create a table in AWS Glue, a serverless data integration service. This table points to the Amazon S3 bucket that holds the incoming clickstream data. Next, you need to define a schema mapping within the AWS Entity Resolution that informs the service on how to interpret the data. Because several of these attributes of a clickstream event are not personal identifiable information but are important for resolution, they are marked as Custom String with an appropriate MatchKey name. The below figure shows a schema mapping defined within AWS Entity Resolution for the clickstream event.

A few things to note: three fields have the same MatchKey userIdentifier because one or more of these fields are important in resolving the match group of the user. Consider the following scenarios:

two or more events possibly belong to the same match group if their IP address (ip_address) is the same
two or more events belong to the same user if their cookie ID (s_cookie) is the same
two or more events belong to the same user if they have the same login ID (login_id)

Thus, the same MatchKey for all the three fields lets the service compare one or more of these fields during the resolution process.

A couple of fields, namely timestamp and page_class, have been marked as pass-through columns because they do not participate in the resolution but may be required later, when the output from AWS Entity Resolution is ingested into downstream sources such as Amazon Neptune.

Next, you need to create a rule-based matching workflow within the AWS Entity Resolution service. This workflow is set up with the clickstream data (represented as an AWS Glue table) as the input source along with the schema mapping defined earlier. A processing cadence of Automatic is selected to verify that as new data arrives, the matching workflow continually keeps resolving the new data against the previously resolved match groups. During this process, the service identifies if the new data belongs to an existing match group; if not, the service forms a new match group.

Please note that the input source Amazon S3 bucket should have the notification in Amazon EventBridge, a serverless event bus, turned on for the Automatic processing cadence.

You need to create a rule-based matching workflow within AWS Entity Resolution with a single rule that uses the userIdentifier, as shown in the figure below. This rule evaluates the incoming clickstream data to link and match records with the same characteristics to a single match group. All records within the same match group are assigned the same MatchID. A MatchID is the unique ID generated by AWS Entity Resolution and applied to all the records within each match group.

The AWS Entity Resolution service writes the output of the matching workflow in an Amazon S3 bucket specified during the workflow creation process.

The output of the matching workflow contains all the input fields (by default) along with other system-generated fields: MatchRule, MatchID, and InputSourceARN. The MatchRule represents the name of which rule, if any, is responsible for the match to occur, while MatchID is the unique ID generated and assigned by the AWS Entity Resolution service to each record. If two or more records are matched on a rule, they belong to the same match group and have the same MatchID.

Explanation of the results

For easy readability, we are including only a few of the columns in this example.

During the Run-1 of the matching workflow, there are 2 clickstream events (Click 1, Click 2) that are grouped together based on their IP address 192.168.117.155 because of the MatchRule, Rule 1, that uses either ip_address, s_cookie, or login_id to group the events together with the same MatchID.

Input (Run-1):

event_id	user_agent	ip_address	zip_code	timestamp	login_id	s_cookie
Click1	Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6	en-US	192.168.117.155	1408	8/22/2023 13:05
Click2	Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6	en-US	192.168.117.155	1408	8/22/2023 13:06		fb89a211-78b7-42f4-a792-9538bfb2c13f

Output (Run-1):

recordid	event_id	matchrule	ip_address	login_id	s_cookie	timestamp	user_agent	matchid
Click1	Click1	Rule 1	192.168.117.155			8/22/2023 13:05	Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6	a3806aa4fb593e7f856fba938c24ab19
Click2	Click2	Rule 1	192.168.117.155		fb89a211-78b7-42f4-a792-9538bfb2c13f	8/22/2023 13:06	Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6	a3806aa4fb593e7f856fba938c24ab19

Subsequently, as new clickstream events arrive, one of the events has the same ip_address as Click1 and Click2 and contains a login_id. AWS Entity Resolution matches the new event (Click3) with the previous two events and inherits the same MatchID, as seen in the output table. Additionally, the output also contains previously associated records of that MatchGroup (Click1 and Click2) with only “recordid” populated and all other associated columns empty.

Input (Run-2):

event_id	user_agent	ip_address	timestamp	login_id	s_cookie
Click3	Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6	192.168.117.155	8/23/2023 13:38	john@doe.com	fb89a211-78b7-42f4-a792-9538bfb2c13f

Output (Run-2):

recordid	event_id	matchrule	ip_address	login_id	s_cookie	timestamp	user_agent	matchid
Click1		Rule 1						a3806aa4fb593e7f856fba938c24ab19
Click2		Rule 1						a3806aa4fb593e7f856fba938c24ab19
Click3	Click3	Rule 1	192.168.117.155	john@doe.com	fb89a211-78b7-42f4-a792-9538bfb2c13f	8/23/2023 13:38	Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; mai-IN) AppleWebKit/531.21.6 (KHTML, like Gecko) Version/4.0.5 Mobile/8B111 Safari/6531.21.6	a3806aa4fb593e7f856fba938c24ab19

Data connections within Amazon Neptune

Once data has been matched by AWS Entity Resolution and stored in Amazon S3, marketers can build downstream applications to derive actionable insights with a consolidated view of consumers. Product recommendation, product activation, and consumer householding are common use cases that link disambiguated users and their captured behaviors. Amazon Neptune is one such service that can help identify additional relationships and links between disambiguated entities.

For example, take the recommendation domain. A common way to make recommendations is collaborative filtering, a technique grounded in the analysis of user behavior data, such as product ratings. A notable challenge in collaborative filtering is the “cold start” problem, which arises when a recommendation engine is unable to generate suggestions for users without a historical data footprint. This issue is particularly prevalent in scenarios involving anonymous consumers, as in the case of clickstream data sources. The cold start challenge can be addressed by capturing disambiguated new anonymous customers, potential households, and their behaviors and “linking” them to a known user profile. AWS Entity Resolution can disambiguate the known and unknown users, and a graph database like Amazon Neptune can generate relationships to form a historical footprint. Figure 1 provides an illustrative workflow for how a graph database forms a single view of a consumer with anonymous and known user events. Amazon Neptune can store the links between Click1 (anonymous metadata), Click2 (anonymous metadata), and Click3 (known user login, jon@doe.com).

Let’s see how to move data from AWS Entity Resolution to Amazon Neptune. For this part of the solution, you will need an Amazon Neptune cluster to store the disambiguated data. To bulk load data, query, and visualize the graph, you will need Amazon Neptune workbench—an interactive development environment (IDE) that provides Jupyter and Jupyter notebooks for running and visualizing code, hosted on Amazon SageMaker, which is used to build, train, and deploy ML models for any use case. The IDE also provides “notebook magics” that can simplify queries and Amazon Neptune management operations. An Amazon Neptune cluster and an Amazon Neptune workbench can be created using this Quick Start template for AWS CloudFormation, which lets you model, provision, and manage AWS and third-party resources by treating infrastructure as code. Users are responsible for the charges for any AWS services used in the example. For help with estimating costs, visit the AWS Pricing Calculator.

Choosing a graph data model depends on the questions that we want to answer. We will use a graph model to help answer the question, “How many consumers or groups share a device?” This will create a data model with four nodes and two edges.

Nodes

Group: The MatchID assigned by AWS Entity Resolution
Device: The combined device details from the clickstream event
IP: The IP address from the clickstream event
Login: A user’s login ID

Edges

HAS_IP: The relationship from the Group node to an IP node
HAS_DEVICE: The relationship from a Group node to a Device node
HAS_LOGIN: The relationship from a Group node to a Login node

Once the data model has been created, we can begin transforming the data from the AWS Entity Resolution output Amazon S3 bucket into the defined data model in the Amazon Neptune workbench. The code samples below will transform the data into the Gremlin bulk loader format and write back into the Amazon S3 bucket under the bulkload/nodes or bulkload/edges prefix.

# Read from AWS Entity Resolution output s3 bucket
import awswrangler as wr
df = wr.s3.read_csv('s3://s3_path/s3_filename')

Create nodes and edges.

## Create group nodes
sor = df[['MatchID']].drop_duplicates().dropna()
sor['~id'] = 'Group-'+sor['MatchID']
sor['~label'] = 'Group'
sor['MatchID:String'] = sor['MatchID']
wr.s3.to_csv(sor, 's3://s3_path/bulkload/nodes/groups.csv', columns = ['~id', '~label', 'MatchId:String(single)'], index = False)

## Create login nodes
lg= df[['login_id', 'RecordId']].drop_duplicates().dropna(subset='login_id')
lg['~id'] = 'Login-'+lg['login_id']
lg['~label'] = 'Login'
lg.rename(columns= {'RecordId': 'RecordId:String', 'login_id': 'login_id:String(single'}, inplace = True)
wr.s3.to_csv(lg, 's3://s3_path/bulkload/nodes/login.csv', columns = ['~id', '~label', 'Login:String(single)', 'recordId:String'], index = False)

## Create device nodes
df['system_id'] = df[['user_agent_device_family','user_agent_os_version','user_agent_os_family']].astype(str).replace('nan', np.nan).sum(1, skipna = True).replace(0, np.nan)
device_nodes = df[['system_id', 'RecordId']].drop_duplicates().dropna(subset='system_id')
device_nodes['~id'] = "Device-" + device_nodes['system_id']
device_nodes['~label'] = "Device"
device_nodes.rename(columns = ['RecordId': 'RecordId:String', 'system_id': 'system_id:String(single)', inplace = True)
wr.s3.to_csv(device_nodes[['~id', '~label', 'system_id:String(single)']], 's3://s3_path/bulkload/nodes/Devices.csv',index = False
)
## Create IP nodes
ip = df[['ip', 'RecordId']].drop_duplicates().dropna(subset='ip_address')
ip['~id'] = 'IP-'+ip['ip']
ip['~label'] = 'IPAddress'
ip.rename(columns= {'RecordId': 'RecordId:String', 'ip_address': 'ip_address:String(single)'}, inplace = True)
wr.s3.to_csv(ip[['~id', '~label', 'ip_address:String(single)', 'recordId:String']], 's3://s3_path/bulkload/nodes/ips.csv',index = False)

## Group to login edges
has_login= df[['MatchID', 'login_id']].drop_duplicates().dropna(subset='login_id')
has_login['~to'] = "Login-"+ has_login['login_id']
has_login['~from'] = "Group-"+ has_login['MatchID']
has_login['~label'] = "HAS_LOGIN"
has_login['~id'] = has_login['~label'] +'-' + has_login['~from'] + has_login['~to']
wr.s3.to_csv(has_login, 's3://s3_path/bulkload/edges/has_login.csv', columns = ['~id', '~label', '~from', '~to'], index = False)

## Group to Ip edges
has_ip= df[['MatchID', 'ip_address']].drop_duplicates().dropna()
has_ip['~to'] = "IP-"+ has_ip['ip_address']
has_ip['~from'] = "Group-"+ has_ip['MatchID']
has_ip['~label'] = "HAS_IP"
has_ip['~id'] = has_ip['~label'] +'-' + has_ip['~from'] + has_ip['~to']
wr.s3.to_csv(has_ip, 's3://s3_path/bulkload/edges/has_ip.csv', columns = ['~id', '~label', '~from', '~to'], index = False)

## Group to Device edges
in_group = df[['system_id', 'MatchID']].drop_duplicates().dropna()
in_group['~to'] = "Device-"+ in_group['system_id']
in_group['~from'] = "Group-"+ in_group['MatchID']
in_group['~label'] = "HAS_DEVICE"
in_group['~id'] = in_group['~label'] +'-' + in_group['~from'] + in_group['~to']
wr.s3.to_csv(in_group, 's3://s3_path/bulkload/edges/has_device.csv', columns = ['~id', '~label', '~from', '~to'], index = False)

Once all transformations are complete, perform a bulk load with notebook magics. You will need the Amazon Neptune IAM role and the Amazon S3 bucket details to perform the load.

%load -s s3://s3_path/bulkload --store-to loadres

If the load fails, validate the status and errors of the load using the load_status notebook magic.

%load_status {loadres['payload']['loadid']} --errors

Visualization and querying of disambiguated identities

Amazon Neptune supports multiple visualization options, including Graph Explorer, an open-source graph visualization tool, and Amazon Neptune workbench. The Amazon Neptune workbench also supports graph query language exploration in addition to running code. This blog will showcase example queries and graph visualizations in the Amazon Neptune workbench to answer some common questions and understand the interconnections between the resolved identities.

“Given a group, find the metadata elements associated with the groups and the RecordIDs that were consolidated by AWS Entity Resolution.”

%%oc

Match (g:Group {`~id`:'Group-a3806aa4fb593e7f856fba938c24ab19'})-[:HAS_IP | :HAS_DEVICE | :HAS_LOGIN]->(p)

return g.`~id` as MatchID, labels(p) as MetadataType, p.`~id` as ElementMap, collect(p.RecordId)

Results:

Group	MetadataType	ElementId	RecordId
a3806aa4fb593e7f856fba938c24ab19	IPAddress	IP-192.168.117.155	[Click1, Click2, Click3]
a3806aa4fb593e7f856fba938c24ab19	Device	Device-Mobile Safari4.0.5iOS	[‘670f456b-3182-41b9-8df7-99cb1a2730a4’, ‘Click1’, ‘Click2’, ‘Click3’]
a3806aa4fb593e7f856fba938c24ab19	Login	Login-john@doe.com	[Click3]

Visualization:

“Given a group, find all other groups or consumers that share the same Device.”

Match (g:Group {`~id`:'Group-f10509080d994778a8989142ca0891ea'})-[:HAS_DEVICE]->(d)
with d
Match (g2)-[i:HAS_DEVICE]->(d)
return g2.`~id` as Groups, d.system_id as Device

Results:

Group	Device
f10509080d994778a8989142ca0891ea	IE7.0Windows
16b102c92ec84f2780a9ad0e807c8370	IE7.0Windows
849f2dbd7f1b4d23ac1abc82bc84bbb5	IE7.0Windows

Visualization:

“Given a Device, get all groups that have used the device and all IP addresses associated with the group.”

Match (g:Group)-[:HAS_DEVICE]->(d:Device {`~id`:'Device-Mobile Safari3.0.5iOS'})
with g, d
Match (g)-[:HAS_IP]->(addr)
return g.MatchID as MatchID, addr.ip_address as IPAddress, d.system_id as Device

Results:

Group	IPAddress	Device
638fbd290d4b37db83f5748d23512c04	192.168.234.47	Mobile Safari3.0.5iOS
1f699577718938a59c16d8259b156458	192.168.12.207	Mobile Safari3.0.5iOS
8564d3a526843115a81b72a17f2884a2	192.168.139.21	Mobile Safari3.0.5iOS
8f5374bcd3f53a3c93c2db0c53baf66e	172.23.181.226	Mobile Safari3.0.5iOS

Visualization:

Conclusion

AWS Entity Resolution and Amazon Neptune can be integrated to understand the networks and interconnections between anonymous consumers. With a 360-degree consumer graph in Amazon Neptune, businesses can use advanced analytics for near real-time product recommendations, such as expediting product recommendations based on the historical data of identified consumers.

This solution architecture can help improve data quality by identifying duplicates and can drive innovation, analysis, and compliance by understanding the relationships between each entity.

To get started, visit AWS Entity Resolution and Amazon Neptune.

You can also use a managed service for your end-to-end data management needs, including data ingestion from more than 80 software-as-a-service application data connectors, unified profile creation (including entity resolution to remove duplicate profiles), and low-latency data access using Amazon Connect Customer Profiles, which empowers agents with the customer insights they need to deliver personalized customer service. With a complete view of relevant customer information in a single place, companies can provide more personalized customer service, deliver more relevant campaigns, and improve customer satisfaction. You can read how to build unified customer profiles using Amazon Connect, an AI-powered contact center, or watch how Choice Hotels has used Amazon Connect Customer Profiles to build unified traveler profiles.

Select your cookie preferences

AWS for Industries

Create a 360-degree view of your consumers using AWS Entity Resolution and Amazon Neptune

Introduction to the AWS services used

Illustrative example: Connecting anonymous consumer data with known consumer data using first-party identifiers

Explanation of the results

Data connections within Amazon Neptune

Visualization and querying of disambiguated identities

Conclusion

Resources

Follow

Learn

Resources

Developers

Help