AWS Architecture Blog
Practical Entity Resolution on AWS to Reconcile Data in the Real World
This post was co-written with Mamoon Chowdry, Solutions Architect, previously at AWS.
Businesses and organizations from many industries often struggle to ensure that their data is accurate. Data often has to match people or things exactly in the real world, such as a customer name, an address, or a company. Matching our data is important to validate it, de-duplicate it, or link records in different systems together. Know Your Customer (KYC) regulations also mean that we must be confident in who or what our data is referring to. We must match millions of records from different data sources. Some of that data may have been entered manually and contain inconsistencies.
It can often be hard to match data with the entity it is supposed to represent. For example, if a customer enters their details as, “Mr. John Doe, #1a 123 Main St.“ and you have a prior record in your customer database for ”J. Doe, Apt 1A, 123 Main Street“, are they referring to the same or a different person?
In cases like this, we often have to manually update our data to make sure it accurately and consistently matches a real-world entity. You may want to have consistent company names across a list of business contacts. When there isn’t an exact match, we have to reconcile our data with the available facts we know about that entity. This reconciliation is commonly referred to as entity resolution (ER). This process can be labor-intensive and error-prone.
This blog will explore some of the common types of ER. We will share a basic architectural pattern for near real-time ER processing. You will see how ER using fuzzy text matching can reconcile manually entered names with reference data.
Multiple ways to do entity resolution
Entity resolution is a broad and deep topic, and a complete discussion would be beyond the scope of this blog. However, at a high level there are four common approaches to matching ambiguous fields or records, to known entities.
-
- Fuzzy text matching. We might normally compare two strings to see if they are identical. If they don’t exactly match, it is often helpful to find the nearest match. We do this by calculating a similarity score. For example, “John Doe” and “J Doe” may have a similarity score of 80%. A common way to compare the similarity of two strings is to use the Levenshtein distance, which measures the distance between two sequences.
We may also examine more than one field. For example, we may compare a name and address. Is “Mr. J Doe, 123 Main St” likely to be the same person as “Mr John Doe, 123 Main Street”? If we compare multiple fields in a record and analyze all of their similarity scores, this is commonly called Pairwise comparison.
2. Clustering. We can plot records in an n-dimensional space based on values computed from their fields. Their similarity to other reference records is then measured by calculating how close they are to each other in that space. Those that are clustered together are likely to refer to the same entity. Clustering is an effective method for grouping or segmenting data for computer vision, astronomy, or market segmentation. An example of this method is K-means clustering.
3. Graph networks. Graph networks are commonly used to store relationships between entities, such as people who are friends with each other, or residents of a particular address. When we need to resolve an ambiguous record, we can use a graph database to identify potential relationships to other records. For example, “J Doe, 123 Main St,” may be the same as “John Doe, 123 Main St,” because they have the same address and similar names.
Graph networks are especially helpful when dealing with complex relationships over millions of entities. For example, you can build a customer profile using web server logs and other data.
4. Commercial off-the-shelf (COTS) software. Enterprises can also deploy ER software, such as these offerings from the AWS Marketplace and Senzing entity resolution. This is helpful when companies may not have the skill or experience to implement a solution themselves. It is important to mention the role of Master Data Management (MDM) with ER. MDM involves having a single trusted source for your data. Tools, such as Informatica, can help ER with their MDM features.
Our solution (shown in Figure 1) allows us to build a low-cost, streamlined solution using AWS serverless technology. The architecture uses AWS Lambda, which allows you to run code without having to provision or manage servers. This code will be invoked through an API, which is created with Amazon API Gateway. API Gateway is a fully managed service used by developers to create, publish, maintain, monitor, and secure API operations at any scale. Finally, we will store our reference data in Amazon Simple Storage Service (S3).
Entity resolution solution using AWS serverless services
We initially match manually entered strings to a list of reference strings. The strings we will try to match will be names of companies.
- Our API takes a string as input
- It then invokes the ER Lambda function
- This loads the index and data files of our reference dataset
- The ER finds the closest match in the list of real-world companies
- The closest match is returned
The reference data and index files were created from an export of the fuzzy match algorithm.
The fuzzy match algorithm in detail
The algorithm in the AWS Lambda function works by converting each string to a collection of n-grams. N-grams are smaller substrings that are commonly used for analyzing free-form text.
The n-grams are then converted to a simple vector. Each vector is a numerical statistic that represents the Term Frequency – Inverse Document Frequency (TF-IDF). Both TF-IDF and n-grams are used to prepare text for searching. N-grams of strings that are similar in nature, tend to have similar TF-IDF vectors. We can plot these vectors in a chart. This helps us find similar strings as they are grouped or clustered together.
Comparing vectors to find similar strings can be fairly straightforward. But if you have numerous records, it can be computationally expensive and slow. To solve this, we use the NMSLIB library. This library indexes the vectors for faster similarity searching. It also gives us the degree of similarity between two strings. This is important because we may want to know the accuracy of a match we have found. For example, it can be helpful to filter out weak matches.
The entity resolution Lambda
Using the NMSLIB library, which is loaded using Lambda layers, we initialize an index using Neighborhood APProximation (NAPP).
# initialize the index
newIndex = nmslib.init(method='napp', space='negdotprod_sparse_fast',
data_type=nmslib.DataType.SPARSE_VECTOR)
Next we imported the index and data files that were created from our reference data.
# load the index file
newIndex.loadIndex(DATA_DIR + 'index_company_names.bin',
load_data=True)
The input parameter companyName is then used to query the index to find the approximate nearest neighbor. By using the knnQueryBatch method, we distribute the work over a thread pool, which provides faster querying.
# set the input variable and empty output list
inputString = companyName
outputList = []
# Find the nearest neighbor for our company name
# (K is the number of matches, set to 1)
newQueryMatrix = vectorizer.transform(inputString)
newNbrs = index.knnQueryBatch(newQueryMatrix, k = K, num_threads = numThreads)
The best match is then returned as a JSON response.
# return the match
for i in range(K):
outputList.append(orgNames[newNbrs[0][0][i]])
return {
'statusCode': '200',
'body': json.dumps(outputList),
}
Cost estimate for solution
Our solution is a combination of Amazon API Gateway, AWS Lambda, and Amazon S3 (hyperlinks are to pricing pages). As an example, let’s assume that the API will receive 10 million requests per month. We can estimate the costs of running the solution as:
Service | Description | Cost |
---|---|---|
AWS Lambda | 10 million requests and associated compute costs | $161.80 |
Amazon API Gateway | HTTP API requests, avg size of request (34 KB), Avg message size (32 KB), requests (10 million/month) | $10.00 |
Amazon S3 | S3 Standard storage (including data transfer costs) | $7.61 |
Total | $179.41 |
Table 1. Example monthly cost estimate (USD)
Conclusion
Using AWS services to reconcile your data with real-world entities helps make your data more accurate and consistent. You can automate a manual task that could have been laborious, expensive, and error-prone.
Where can you use ER in your organization? Do you have manually entered or inaccurate data? Have you struggled to match it with real-world entities? You can experiment with this architecture to continue to improve the accuracy of your own data.
Further reading: