AWS Big Data Blog

Matching patient records with the AWS Lake Formation FindMatches transform

Patient matching is a major obstacle in achieving healthcare interoperability. Mismatched patient records and inability to retrieve patient history can cause significant barriers to informed clinical decision-making and result in missed diagnoses or delayed treatments. Additionally, healthcare providers often invest in patient data deduplication, especially when the number of patient records is growing rapidly in their databases. Electronic Health Records (EHRs) have significantly improved patient safety and care coordination in recent years; however, accurate patient matching remains a challenge for many healthcare organizations.

Duplicate patient records emerge for a variety of reasons, including human-generated number insertion, deletion, substitution, or transposition errors. Optical Character Recognition (OCR) software, which digitizes patient records, may also introduce errors.

Multiple types of record-matching algorithms exist to solve this problem. These include basic deterministic methods such as grouping and comparing relevant fields (such as SSN, name, or date of birth), phonetic encoding systems, and more advanced algorithms using machine learning (ML).

AWS Lake Formation is a HIPAA-eligible service that helps you build a secure data lake in a few simple steps. Lake Formation also contains FindMatches, an ML transform that enables you to match records across different datasets and identify and remove duplicate records, with little to no human intervention.

This post shows you how to use the FindMatches ML transform to identify matching patient records in a synthetically generated dataset. To use FindMatches, you don’t have to write code or know how ML works. This is useful for finding matches in your data when it doesn’t include a reliable, unique personal identifier, even if the fields don’t match exactly.

Patient dataset

Various regulations in different countries govern patient data due to its sensitive nature. As such, the availability of patient data on which to train matching algorithms is often scarce, which complicates model development. A common method to get around these challenges is synthetic data. This post generates patient data based on the Open Source Freely Extensible Biomedical Record Linkage Program (FEBRL). FEBRL uses Hidden Markov Models (HMMs) to prepare the name and address data for patient record matching. It also allows for mimicking real-life patient datasets that lead to duplicates, which may have the following mismatches:

  • Blank fields.
  • Typographical errors such as misspelling, transposing the characters, or swapping the fields.
  • Shortened middle names versus records complete middle names.
  • Various formats of mailing address and its components.
  • OCR-related errors.
  • Phonetical errors.
  • No globally unique patient or person identifier. Every healthcare provider may have a patient identifier assigned to the same person but may not have a person identifier like SSN, so they have datasets without keys.

FEBRL can generate these types of datasets based on configurable parameters to change the probabilities of each type of error, and thus incorporate a variety of scenarios leading to duplicates. The generation of the synthetic dataset is beyond the scope of this post; this post provides a pre-generated dataset for you to explore. In brief, below are the steps to generate the synthetic dataset which will be used to run FindMatches:

  1. Download and install FEBRL.
  2. Modify the parameters to create a dataset mimicking your expectations. For more information, see the dataset generation instructions of FEBRL.
  3. Cleanse the dataset (this confirms the same schema for each record, and removes single quotes and family role).

The dataset for this post uses AWS Region US East (N. Virginia).

FRBRL patient data structure

The following table shows the structure of FEBRL patient data. The data includes 40,000 records.

The original record and duplicate records are grouped together. The patient_id values are generated in a specific format: rec-<record number>-org/dup-<duplicate record number>, followed by the FEBRL data generator tool.

The following table is a preview of what you’re working to achieve with the FindMatches ML transform. Once you match datasets, the resulting table mirrors the input table’s structure and data and adds a match_id column. Matched records display the same match_id value. It is still possible to have false positives and false negatives, but the transform is still extremely beneficial.


The sample synthetic patient dataset for this post uses Region US East (N. Virginia), and thus all the steps mentioned in this post must be performed in the same AWS region i.e. us-east-1 but the steps are easily modifiable if your data is in a different region.

Solution architecture

The following diagram shows the solution’s architecture.



Solution overview

At a high level, the matching process includes the following steps:

  1. Upload raw patient dataset in csv format on Amazon S3 Bucket
  2. Crawl the uploaded patient dataset using AWS Glue crawler
  3. Catalog your patient data with the AWS Glue Data Catalog and create a FindMatches ML transform.
  4. Create a label set, either using ML transform or manually, and teach FindMatches by providing labeled examples of matching and non-matching records. Upload your labels and estimate the quality of the prediction. Add more labelsets and repeat this step as required to get the required Precision, Accuracy and Recall rate.
  5. Create and execute an AWS Glue ETL job that uses your FindMatches transform.
  6. Store the results of FindMatches transform on Amazon S3 bucket
  7. Create an AWS Glue data catalog of FindMatches ML transform results.
  8. Review the transform results with Amazon Athena.

Cataloging your data and creating a FindMatches ML transform

FindMatches operates on tables defined in the AWS Glue Data Catalog. Use AWS Glue crawlers to discover and catalog the patient data. You can use the FEBRL patient data generated for this post.

The Cloudformation stack provided below creates the resources in the AWS region - us-east-1 (US East (N. Virginia)

To create the Catalog and FindMatches ML transform in AWS Glue, launch the following stack:

This stack creates the following resources:

  • An Amazon S3 bucket that stores the ML transform results (configurable as part of the launch). You can find the name of the bucket on the AWS CloudFormation stacks console, under the Outputs This post uses the name S3BucketName.
  • An IAM role to allow AWS Glue to access other services, including S3.
  • An AWS Glue database (configurable as part of the launch).
  • An AWS Glue table that represents the Public Original Synthetic Patients Dataset (configurable as part of the launch).
  • An AWS Glue ML transform with the source as your AWS Glue table, with Accuracy set to 1 and Precision set to 0.9.

For more information, see Integrate and deduplicate datasets using AWS Lake Formation FindMatches.

Tuning the ML transform

The safety risks of false positive matches, in which clinicians believe incorrect information about the patient to be accurate, may be greater than the safety risks of false negative matches, in which clinicians lack access to the existing information about the patient. (For more information, see the related study on the NCBI website.) Therefore, moving the Recall vs. Precision slider toward Precision has a higher level of confidence to identify if the records belong to the same patient and minimizes the safety risks of false positive matches.

A higher Accuracy setting helps achieve higher recall, at the expense of a longer runtime (and thus cost) necessary to compare more records.

To achieve the comparatively better results for this particular dataset, the launch stack already created the transform for you with the Recall vs. Precision slider set to 0.9 toward Precision and the Lower Cost vs. Accuracy slider set to Accuracy. If needed, you can adjust these values later by selecting the transform and using the Tune menu.

Teaching FindMatches using labeled data

After running the launch stack successfully, you can train the transform by supplying it with matching and non-matching records using labelsets.

Creating a labeling set

You can create a labeling set yourself or allow AWS Glue to generate the labeling set based on heuristics.

AWS Glue extracts records from your source data and suggests potential matching records. The generated labelset file contains approximately 100 data samples for you to work on.

This post provides you with an AWS Glue generated labeled data file that you can use, with a fully populated label column. This file is fully ready for consumption.

If you choose to use the pre-generated labeled data file provided in this post, skip the below labeling file generation steps

To create the training set, complete the following steps:

  1. On the AWS Glue console, Under ETL, Jobs and ML transforms you will find the ML transform with name cfn-findmatches-ml-transform-demo created for you by the stack provided.
  2. Choose the ML transform cfn-findmatches-ml-transform-demo and click on Action and select Teach transform
  3. For Teach the transform using labels, choose I do not have labels.
  4. Choose Generate labeling file.
  5. Provide the S3 path to store the generated label file.
  6. Choose Next.

The following table shows the generated labeled data file with an empty label column.

You need to populate the label column by marking the records that are a real match with the same value. Each labelset should contain positive and negative match examples.

This post provides you with a labeled data file that you can use, with a fully populated label column. This file is fully ready for consumption.

The following table shows the table with a completed label column.

The labeling file has the same schema as the input data, plus two additional columns: labeling_set_id and label.

The training dataset is divided into labeling sets. Each labeling set displays a labeling_set_id value. This identification simplifies the labeling process, enabling you to focus on the match relationship of records within the same labeling set, rather than having to scan the entire file. For the preceding dataset, extract the label values from patient_id by removing the suffix -org and -dup using regular expression. But in general, you would assign these labels according to which records should match based on the attribute values.

If you specify the same label value for two or more records within a labeling set, you teach the FindMatches transform to consider these records a match. On the other hand, when two or more records have different labels within the same labeling set, FindMatches learns that these records aren’t considered a match. The transform evaluates record relationships between records within the same labeling set, not across labeling sets.

You should label a few hundred records to achieve a modest match quality, and a few thousand records to achieve a high match quality.

Uploading your labels and reviewing match quality

After you create the labeled dataset (which needs to be in .csv format), teach FindMatches where to find it. Complete the following steps:

  1. On the AWS Glue console, select the transform that you created earlier.
  2. Choose Action.
  3. Choose Teach transform.
  4. For Upload labels, choose I have labels.
  5. Choose Upload labeling file from S3.
  6. Choose Next.
  7. If you want to use the labelset provided in this blog post, download the labelset here.
  8. Create a folder with name trainingset in the same S3 bucket created by the previously launched cloudformation template above.
  9. Upload the above labelset in the trainingset folder in the same S3 bucket
  10. Choose Overwrite my existing labels.You are only using one set of labels. If adding labels iteratively, choose the Append to my existing labels option.
  11. Choose Upload.With the labels uploaded, your transform is now ready to use. Though not strictly required, check the transform match quality by reviewing the metrics of matching and non-matching records.
  12. Choose Estimate transform quality.The transform quality estimate learns using 70% of your labels. After it’s trained, the quality estimate tests how well the transform learned to identify matching records against the remaining 30%. Finally, the transform generates quality metrics by comparing the matches and non-matches predicted by the algorithm vs. your actual labels. This process may take up to several minutes.

Your results should be similar to those in the following screenshot.

Consider these metrics approximate, because the test uses only a small subset of data for estimating quality. If you’re satisfied with the metrics, proceed with creating and running a record-matching job. Or, to improve matching quality further, upload more labeled records.

Creating and running an AWS Glue ETL job

After you create a FindMatches transform and verify that it learned to identify matching records in your data, you’re ready to identify matches in your complete dataset. To create and run a record-matching job, complete the following steps:

  1. Create a transformresults folder inside the S3 bucket that the AWS CloudFormation template created when you launched the stack.This folder stores the MLTransform results from your AWS Glue job.
  2. On the AWS Glue console, under Jobs, choose Add job.
  3. Under Configure the job properties, for Name, enter a name for the job.
  4. For IAM role, choose your role from the dropdown menu.Choose the IAM role that the AWS CloudFormation stack created, called AWSGlueServiceRoleMLTransform. For more information, see Create an IAM Role for AWS Glue.
  5. Select Spark as the Type with Glue version as Spark 2.2, Python 2 (Glue version 0.9)
  6. Select job run as A proposed script generated by AWS Glue.
  7. For Choose a data source, choose the transform data source.This post uses the data source cfn_table_patient.
  8. Under Choose a transform type, choose Find matching records.
  9. For Worker type, choose G.2X.
  10. For Number of workers, enter 10.You can add more workers based on the size of the datasets by increasing this number.
  11. To review records identified as duplicate, do not select Remove duplicate records.
  12. Choose Next.
  13. Choose the transform that you created.
  14. Choose Next.
  15. For Choose a data target, choose Create tables in your data target.
  16. For Data store, choose Amazon S3.
  17. For Format, choose CSV.
  18. For Compression type, choose None.
  19. For Target path, choose a path for the job’s output.The target path is the S3 bucket that AWS CloudFormation created, along with the folder named transformresults you created previously.
  20. Choose Save job and edit script.The script is now generated for your job and ready to use. Alternatively, you can customize the script further to suit your specific ETL needs.
  21. To start identifying matches in this dataset, choose Run job as shown in the below screen.For now, leave the job parameters with the default settings, and close this page after starting the job.The following screenshot shows the proposed Python Spark script generated by AWS Glue using the ML Transform.If the execution is successful, FindMatches shows the run status as Succeeded. The execution might take a few minutes to finish.

FindMatches saves your output data as multi-part .csv files in the target path that you specified during the job creation. The resulting .csv files mirror the input table’s structure and data, but have a match_id column. Matched records display the same match_id value.

Creating a Data Catalog of your transform results

To view the output, you can either download the multi-part .csv files from the S3 bucket directly and review it via your preferred editor, or you can run SQL-like queries against the data stored on S3 using Athena. To view the data using Athena, you need to crawl the folder with the multi-part .csv files that you created as part of the output of your FindMatches ETL job.

Go to AWS Glue and create a new table using AWS Glue crawlers in the existing database for patient matching that holds the records from the output of your FindMatches ETL job with the source data as the folder of your S3 bucket containing multi-part .csv files. For this post, the source data is the folder transformresults in the bucket created by the AWS CloudFormation stack.

To create a crawler, complete the following steps:

  1. On the AWS Glue console, under Data Catalog, choose Crawlers.
  2. Click Add crawler to create a new crawler to crawl the transform results.
  3. Provide the name of the crawler and click Next.
  4. Choose Data stores as the Crawler source type and click Next.
  5. In the Add a data store section, for Choose a data store, choose S3.
  6. For Crawl data in, choose Specified path in my account.
  7. For Include path, enter the name of your path. This should be the same S3 bucket created by cloudformation previously along with the folder named transformresults you created. Verify the folder has multi-part csv files created.
  8. Choose Next.
  9. In the Choose an IAM role section, choose Choose the IAM role.
  10. For IAM role, enter the name of the crawler.
  11. Choose Next.
  12. Select Run on demand for Frequency.
  13. Configure the crawler’s output with Database set to cfn-database-patient.
  14. Set the Prefix added to tables value to be table_results_. This will help identify the table containing the transform results.
  15. Click on Finish.
  16. Select the same crawler and click on Run the crawler.After crawler execution is successful, you should see a new table created corresponding to the crawler settings in the respective database you selected during crawler configuration.
  17. From the AWS Glue console, under Databases, choose Tables.
  18. Choose Action.
  19. Choose Edit table details.
  20. For Serde Serialization lib, enter org.apache.hadoop.hive.serde2.OpenCSVSerde.
  21. Under Serde parameters, add key escapeChar, with value as \\.
  22. Add key quoteChar with value as " (double quotes).
  23. Set key field.delim with value as ,
  24. Add key separatorChar with value as ,You can set the Serde parameters as per your requirements based on the type of datasets you have.
  25. Edit the schema of the table by setting the data types of all columns to String. To edit the schema of the table, click on the table and click on the Edit schema button.

You can also choose to retain the inferred data types by crawler as per your requirements. This post sets all to the String datatype for the sake of simplicity, except for the match_id column, which is set as bigint.

Reviewing the output with Amazon Athena

To review the output with Amazon Athena, complete the following steps:

  1. From the Data Catalog, choose Tables.
  2. Choose the table name created by your crawler for the results.
  3. Choose Action.
  4. Choose View data.

    The Athena console opens.  If you are running Amazon Athena for the first time, you might have to click on Get Started. Before you run your first query, you will also need to set up a query result location in Amazon S3. Click on set up a query result location in Amazon S3 on Amazon Athena console and set the location of the query results. You can create additional folder in the same Amazon S3 bucket created previously by the cloudformation. Please make sure the S3 path ends with a /.
  5. Choose the appropriate database.For this post, choose cfn-database-patient. You might need to refresh the data source if you do not see the database in the drop down.
  6. Choose the results table that contains the FindMatches output containing the patient records with the match_id column. In this case, it will be table_results_transformresults. If you chose a different name for the results table, the below query needs to be changed to reflect the correct table name.
  7. Run the below query by choosing Run query.
    SELECT * FROM "cfn-database-patient"."table_results_transformresults" order by match_id;

The following screenshot shows your output.

Security Considerations

AWS Lake Formation helps protect your data by giving you a central location in which you can configure granular data access policies, regardless of which services you use to access it.

To centralize data access policy controls using Lake Formation, first shut down direct access to your buckets in S3 so Lake Formation manages all data access. Configure data protection and access policies using Lake Formation, which enforces those policies across all the AWS services accessing data in your lake. You can configure users and roles and define the data these roles can access, down to the table and column level.

AWS Lake Formation provides a permissions model that is based on a simple grant/revoke mechanism. Lake Formation permissions combine with IAM permissions to control access to data stored in data lakes and to the metadata that describes that data. For more information, see Security and Access Control to Metadata and Data in Lake Formation.

Lake Formation currently supports Server-Side-Encryption on S3 (SSE-S3, AES-265). Lake Formation also supports private endpoints in your VPC and records all activity in AWS CloudTrail, so you have network isolation and auditability.

AWS Lake Formation service is a HIPAA eligible service.


This post demonstrated how to find matching records in a patient database using the Lake Formation FindMatches ML transform. It allows you to find matches when the records in two datasets don’t share a common identifier or include duplicates. This method helps you find matches between dataset rows when fields don’t match exactly or attributes are missing or corrupted.

You are now ready to start building with Lake Formation and try FindMatches on your data. Please share your feedback and questions in the comments.


About the Authors

Dhawalkumar Patel is a Senior Solutions Architect at Amazon Web Services. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He is currently focused on Machine Learning and Serverless technologies




Ujjwal Ratan is a principal machine learning specialist solution architect in the Global Healthcare and Lifesciences team at Amazon Web Services. He works on the application of machine learning and deep learning to real world industry problems like medical imaging, unstructured clinical text, genomics, precision medicine, clinical trials and quality of care improvement. He has expertise in scaling machine learning/deep learning algorithms on the AWS cloud for accelerated training and inference. In his free time, he enjoys listening to (and playing) music and taking unplanned road trips with his family.