How to build serverless entity resolution workflows on AWS

Consumers expect companies to engage with them in a highly relevant and personalized way regardless of channel. In a study done by Twilio, 60 percent of consumers say that they will make a repeat purchase after a personalized experience. A study from McKinsey & Company indicates that more than 70 percent of consumers expect a personalized journey, and organizations that implement personalization realize a 10–15 percent lift in revenue.

Today, marketers and advertisers need a unified view of consumer data to drive personalized marketing and advertising campaigns across web, mobile, contact center, and social media channels. For example, if a consumer just bought a product on a brand’s website, marketers want to avoid sending them a promotional email for the same product and instead want to delight them with more complimentary products, increasing consumer engagement, loyalty, and trust in the brand. However, marketers often must rationalize multiple and often disparate consumer-level records across different channels, lines of business, and partners. These records have sparse or missing information, erroneous spelling, and incorrect or stale information making them difficult to rationalize. Experian estimates as many as 94 percent of organizations suspect that their consumer and prospect data might be inaccurate. This includes duplication rates between 10 and 30 percent for companies that don’t have data quality initiatives in place. To address these challenges, companies need easy-to-use, configurable, and secure entity resolution capabilities to accurately match, link, and enhance consumer records.

This blog post describes a composable architecture pattern that helps you build a serverless end-to-end entity resolution solution using AWS Entity Resolution. AWS Entity Resolution helps companies match, link, and enhance related records across multiple applications, channels, and data stores using flexible, configurable workflows. This post focuses on building an automated data pipeline that can ingest and prepare data (near real-time and batch-based), perform matching, and retrieve matches in near real-time using AWS Entity Resolution. Customers can also use a managed service for their end-to-end data management needs including data ingestion from 80+ SaaS application data connectors, unified profile creation including entity resolution to remove duplicate profiles, and low latency data access using Amazon Connect Customer Profiles. With a complete view of relevant customer information in a single place, companies can provide more personalized customer service, deliver more relevant campaigns, and improve customer satisfaction. You can read how to build unified customer profiles with Amazon Connect or watch how Choice Hotels has used Customer Profiles to build unified traveler profiles.

High-level example

Let’s use the example of AnyCompany, a leading e-commerce brand, as context for the proposed solution. AnyCompany has over 100 sub-brands across consumer packaged goods (CPGs), electronics, travel, and many more. They want to deliver a personalized experience to their customers and build customer loyalty. Using a composable architecture pattern, AnyCompany builds a serverless solution that ingests records from multiple sources (customer relationship management (CRM), customer data platform (CDP), content management system (CMS), and master data management (MDM) systems) and creates a unified view of their customer.

Proposed solution architecture

The following architecture diagram and description provides an overview of the end-to-end flow for data ingestion, preparation, and resolution using various purpose-built serverless AWS services.

Step I – Historical data processing

On Day 0 of the workflow set up, systems of engagement (source systems) contain historical information about the consumers that is going to be resolved using AWS Entity Resolution.
Using batch data ingestion services or AWS Data Connector Solution, the historical data is loaded into the Amazon Simple Storage Service (Amazon S3) Raw Zone. See AWS Cloud Data Ingestion Patterns and Practices to learn more.
To prepare the historical data for AWS Entity Resolution, use an Amazon EventBridge rule to run an AWS Step Function Standard workflow, which orchestrates a data engineering pipeline. The Amazon EventBridge rule can be scheduled to launch at a particular frequency to process batch data sources.
Within the AWS Step Functions Standard workflow, an AWS Glue job transforms the data stored within the raw Amazon S3 location. Use this step to validate, normalize, and secure personally identifiable information (PII).
1. See Guidance for Customizing Normalization Library for AWS Entity Resolution for building a customized data normalization workflow.
2. See Guidance for Preparing and Validating Records for Entity Resolution on AWS for creating a data preparation workflow that includes validation and standardization of data.
The normalized and validated data is stored within a Clean Zone S3 bucket in CSV or parquet format and cataloged in AWS Glue Catalog as an AWS Glue table.
Configure the Clean Zone S3 bucket to generate EventBridge notifications. See Use Amazon S3 Event Notifications with Amazon EventBridge for more details.
Create an AWS Entity Resolution matching workflow using a rule-based matching technique that operates at an automatic processing cadence. This enables the service to resolve identities from various data sets as they arrive in the Clean Zone. AWS Entity Resolution links and matches records to create unique profiles, called MatchGroups. Each MatchGroup is assigned a unique persistent ID (MatchId).

Step II – Near real-time lookup

Use Amazon API Gateway to host REST APIs that serve the near real-time systems of engagements for their identity resolution lookup needs.
Use Synchronous Express Workflows for AWS Step Functions to orchestrate micro-services to achieve existing entity lookup and other business rule validations in near real time. See New Synchronous Express Workflows for AWS Step Functions for the detailed steps to create a Synchronous Express Workflow with API Gateway integration.
The AWS Step Functions workflow calls one or more AWS Lambda functions in sequence or in parallel to accomplish validation of PII data such as email, address, and phone numbers.
The normalized and validated PII data is sent to the AWS Entity Resolution GetMatchId action as an input to match against the previously created MatchGroups. For example, AnyCompany might want to know if a visitor on their site is a known consumer so that they can provide contextual experience. In this case, the data collected by the first-party (1P) cookie can be sent to the AWS Entity Resolution through the GetMatchId API to match it against the known MatchGroups. If a match is found, the corresponding MatchId is sent back as a response.

Step III – On-going incremental processing

If a match is found, the system of engagement consumes the MatchId and uses it to inform the application decisioning. This ID can be persisted in the application database or used to enrich an event data generated for downstream asynchronous processing.
Every incoming incremental data feed from relevant sources (batch or real-time) is sent to AWS Entity Resolution to ensure that the MatchGroups are kept current. In the near real-time lookup flow, a Lambda function is configured to send incoming lookup requests to Amazon Kinesis Data Streams.
Use Amazon Kinesis Data Firehose to write to the streaming Raw S3 Bucket. This data is then processed by the same workflow that was set up during the historical processing. The incremental data is passed to AWS Entity Resolution to match it against the previously created MatchGroups. If any incremental record resolves to an existing MatchGroup, it inherits the same MatchId, otherwise it might create a new MatchGroup with its own MatchId. In scenarios where an existing record is updated as part of the incremental run, the new information is evaluated against the existing MatchGroup and, based on the information change, it can either split into a new MatchGroup or retain its existing MatchGroup.
AWS Entity Resolution generates output files to the S3 bucket. This is packaged and sent to systems of engagement for activation and personalization using AWS Glue. This post processing might involve merging records from multiple sources into a single golden record, among other things. The post processing also involves processing of any workflow error files that the service generates for human triage.

Security

Use the following AWS services to implement security and access control:
AWS Identity and Access Management (IAM): Least-privilege access to specific resources and operations
AWS Key Management Service (AWS KMS): Manage the life cycle of encryption keys used for securing data at rest and in transit
AWS Secrets Manager: Provides secure storage of secrets such as passwords and API keys
Amazon CloudWatch: Central location to capture logs and metrics across all services used in this solution

Conclusion

In this post, we showed how you can use AWS services to build a serverless entity resolution workflow through the example of AnyCompany’s use case to serve personalized experiences to consumers. We introduced three main capabilities of AWS Entity Resolution that allow you to build data matching workflows.

Near real-time entity lookup
Automatic incremental processing
On-demand batch workflows

For more information, see AWS Entity Resolution Features.

We also explained how other AWS serverless services can be combined to build microservice orchestration for near real-time system integration and batch data processing workflows. See AWS Entity Resolution resources to learn more.