AWS Big Data Blog
Incremental data matching using AWS Lake Formation and AWS Glue
AWS Lake Formation provides a machine learning (ML) capability (FindMatches transform) to identify duplicate or matching records in your dataset, even when the records don’t have a common unique identifier and no fields match exactly. Customers across many industries have come to rely on this feature for linking datasets like patient records, customer databases, and TV shows. The initial release of the FindMatches transform identified matching records within a single dataset. When you had a new dataset, you had to merge it with the existing clean dataset and rerun matching against the complete merged dataset.
We’re excited to announce the Lake Formation FindMatches incremental matching feature (Find Incremental Matches), which enables you to effortlessly match to incremental records against existing matched datasets.
In this post, you learn how to use the Find Incremental Matches capability to match prospects data with existing customer datasets for the marketing department of a fictional company. The dataset used for this post is synthetically generated.
Overview of solution
The marketing department of our fictional company is responsible for organizing promotion campaigns every month and developing communications content to promote services and product to prospects (potential new customers). A list of prospects is generated by multiple internal business processes and also from multiple third-party services.
At end of each month, the marketing team ends up with hundreds of thousands of prospects. Now the team has the herculean task of identifying unique prospects by removing duplicates and existing customers from the list.
The prospect list purchased from the third-party service doesn’t have any common unique identifiers like Social Security number (SSN) or driver’s license, which makes these tasks arduous to do manually.
You can use the ML capabilities of Lake Formation to address this challenge. The Find Incremental Matches transform enables you to identify duplicate or matching records in your dataset, even when the records don’t have a common unique identifier and no fields match exactly.
Specifically, the new incremental match capability provides the flexibility to match hundreds of thousands of new prospects with the existing database of prospects and customers without merging the two databases. Moreover, by conducting matches only between the new and existing datasets, the Find Incremental Matches optimization reduces computation time, which also reduces cost.
The following screenshot shows a sample of the existing customers dataset.
The following screenshot shows a sample of the incremental prospect dataset.
In this post, you perform the following steps for incremental matching:
- Run an AWS Glue extract, transform, and load (ETL) job for initial matching.
- Run an AWS Glue ETL job for incremental matching.
- Verify output data from Amazon Simple Storage Service (Amazon S3) with Amazon Athena.
The first step of initial matching is mandatory in order to perform incremental matching.
Prerequisites
To create resources for incremental matching in AWS Glue, launch the following AWS CloudFormation stack in the us-east-1
Region:
This stack creates the following resources:
- A S3 bucket that stores the input and outputs of matching
- The AWS Glue database
marketing-demo
- AWS Glue tables for existing and incremental customers:
- existing_customers – Raw customer data
- cleaned_existing_customers – Matched and cleaned customer data. This is the output generated by
InitialMatching
job. - incremental_prospects – New incremental prospects data for matching
- unique_prospects – Final output of unique prospects as required by this post’s use case
- The AWS Glue ML transform incremental-match-blog-transform
- AWS Glue Jobs for initial matching and incremental matching:
- InitialMatching – For matching and transforming
existing_customers
tocleaned_existing_customers
- IncrementalMatching – For incrementally matching new prospects data with
cleaned_existing_customers
and identifying unique prospects
- InitialMatching – For matching and transforming
- IAM roles
Run an AWS Glue ETL job for initial matching
Before we perform the incremental matching, we need to clean the existing customer datasets by running an AWS Glue ETL job:
- On the AWS Glue console, choose Jobs in the navigation pane.
- Select the job
InitialMatching
. - On the Action menu, choose Run job.
This job uses the FindMatches transformation to identify unique and matched customers from the existing_customers
table and writes it to the cleaned_existing_customers
table. The transform adds another column named match_id
to identify matching records in the output. Rows with the same match_id
are considered matching records.
The cleaned_existing_customers
table becomes the primary customer data table and incremental customer data is matched against this table.
Run an AWS Glue ETL job for incremental matching
To perform the incremental matching, complete the following steps:
- On the AWS Glue console, choose Jobs in the navigation pane.
- Select the job
IncrementalMatching
. - On the Action menu, choose Run job.
In comparison to the initial FindMatches scripts, the following changes are added to read data from the incremental customers table (lines 24 and 27) and call the incremental matching API (line 30):
L6 | |
L22 | |
L24 | |
L26 | |
L27 | |
L30 |
The DynamicFrame incrementalMatchesResult
contains both matched and unmatched records from the incremental prospects dataset. Matching is done both within the prospects dataset and against the existing customer dataset. In the script, the DynamicFrame incrementalMatchesResult
is further processed to filter and store the unique prospects from the incremental dataset (lines 37–53).
The job takes a few minutes to complete with 10 worker nodes. When the job is complete, you can find the matched records in the target S3 path specified in the script.
Create an AWS Glue job bookmark
Because the incremental matching targets the datasets that are received at certain intervals and joins with the existing dataset to generate output, we highly recommend you enable AWS Glue job bookmarks when you create the job. This way, when the new incremental dataset is available, you can schedule the job to run and don’t need to make any change in the ETL script.
- On the AWS Glue console, choose Jobs in the navigation pane.
- Select the job
IncrementalMatching
. - On the Action menu, choose Edit job.
- Under Advanced properties, for Job bookmark, choose Enable.
- Choose Save.
When a new prospect dataset arrives, you only need to upload it to the bucket of incremental dataset and run the incremental matching job you have created. AWS Glue job bookmarks track both the existing and incremental data that has already been processed during your previous job run, so the job automatically reads the cleaned customer dataset generated by the previous job and the newly added incremental prospect dataset. The incremental matching job writes the output to the same target S3 path.
Verify the output
To review the unique prospects identified by the IncrementalMatching job, complete the following steps:
- On the Athena console, make sure you’re in the correct Region.
- Choose
AwsGlueDataCatalog
as your data source andmarketing_demo
as the database. - Create the following query:
- Choose Run query.
The Results window shows all the unique customers from the incremental customer dataset.
Pricing
In Region us-east-1
, the total runtime is approximately 7 minutes for both the jobs. We configured these jobs to run with 10 workers with the standard worker type, resulting in a total cost of $1.47. Pricing can vary by region. For more information, see AWS Glue pricing.
Conclusion
This post showed how you can incrementally match a new prospect dataset against an existing customer dataset using the Lake Formation FindMatches transform in order to identify unique prospects. You can use a similar process to identify duplicates and matched records from the incremental dataset, and it’s especially useful in the use case of product matching and fraud detection.
To learn more, see the AWS Glue PySpark or Scala documentation. Please send any feedback to the AWS Glue Discussion Forums or through your usual AWS Support contacts.
About the Authors
Shehzad Qureshi is a Senior Software Engineer at Amazon Web Services.
Bin Pang is a software development engineer at Amazon Web Services.
Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data platforms on the AWS Cloud. He has helped customers of all sizes implement data management, data warehouse, and data lake solutions.