AWS for Industries

AWS Entity Resolution for Health Data


This post shows how AWS Entity Resolution can tackle the challenge of linking and matching records in healthcare and provides a walkthrough of how-to setup a patient 360 view with a continuous entity resolution workflow. It also explains how AWS HealthLake can help healthcare customers securely store, transform, and manage their data. By the end of this post, you can get started with the journey of entity resolution in the healthcare industry.

Increasingly, healthcare organizations now have to manage large volumes of data and diversity of sources, each with its unique format and modality. Due to varied data entry and storage methods across health systems, inconsistencies of entities can arise: disparate patient and practitioner inputs, research subjects, studies results, diagnostics reporting and testing, and claims and invoice information. Having the full picture of an entity is important to the patient’s customer experience in healthcare, where incorrect claims can lead to incorrect billing.

For example, if there’s an inconsistency in a patient’s records due to inadequate integration of information, it could lead to errors in billing. Such errors might result in a patient being incorrectly charged for a service they did not receive or not being billed for a service they did. It can cause confusion and dissatisfaction, adding burden to the patient’s and impacting their overall experience with the healthcare system. Accurate and thorough data management is important not only for effective patient care but also for ensuring a positive customer experience in healthcare, where correct billing and clear communication are key components.

Linking and matching records in healthcare is important for accurate patient record matching across various data sources. An entity resolution implementation offers healthcare organizations benefits such as improved patient matching, which enhanced data consistency, streamlined billing processes reducing billing errors, bolstered system interoperability, and compliance with privacy regulations. By maintaining precise patient data, entity resolution ultimately helps healthcare payers and providers to enhance patient care, optimizes operational costs, and strengthens regulatory adherence in a single unified approach.

Record accuracy challenges in health data

The following are some challenges that make it hard for healthcare and life science (HCLS) organizations to link and match disparate yet related records:

Data Fragmentation: Healthcare encompasses a multitude of entities, including patients, practitioners, research subjects and studies, diagnostics reporting and testing, claims and invoices, and more. These entities generate vast amounts of data distributed across disparate systems, such as electronic health records (EHRs), billing platforms, insurance databases, and diagnostic laboratories. These diverse data sources often employ different identification methods or inconsistent data entry practices, leading to discrepancies and errors in entity records. This fragmentation hinders the compilation of comprehensive and accurate profiles for various entities.

Data mobility: In the modern healthcare landscape, patients frequently seek care from different providers, relocate to new geographic areas, or change insurance plans. These changes represent a challenge for HCLS organizations to maintain consistent and accurate patient records throughout their interactions with the healthcare system. Records may become outdated or fragmented, impacting the quality of care, coordination, and data accuracy.

Data Quality: inaccuracies in data are a widespread challenge in various healthcare organizations. Common issues such as misspellings, varying input standards, outdated information, or incomplete records can significantly impact the accuracy of billing data. These inaccuracies can lead to billing errors, such as incorrect charges or missed invoicing, causing frustration for patients and financial discrepancies for healthcare providers. Ensuring the accuracy of billing data is a critical and challenging task that most healthcare organizations face, as it directly affects financial operations and patient satisfaction.

Data Interoperability: Healthcare systems often use a diverse range of technologies, each with its own standards, which presents significant challenges in achieving interoperability and maintaining privacy. These different systems might use unique identifiers or coding systems, complicating the process of accurately cross-referencing information across various healthcare platforms and organizations. This complexity poses not only technical difficulties but also compliance and privacy challenges. Ensuring patient data remains secure and private, while also being accessible and accurate across different systems, requires careful balancing. Healthcare organizations must comply with strict data protection regulations, like HIPAA in the United States, which mandate the safeguarding of patient information. The task involves both technical solutions to ensure seamless data integration and robust privacy policies to maintain confidentiality and compliance with legal standards.

AWS Entity Resolution

AWS Entity Resolution is a HIPAA eligible service that helps companies easily match, link, and enhance related records that exist across multiple applications, systems, and data stores using flexible and configurable workflows that take only minutes to set up:

Flexible Data Preparation: The service provides flexible and customizable data preparation, reading data from Amazon Simple Storage Service (Amazon S3) represented as an AWS Glue table. The service has built-in data normalization capabilities that can cleanse and bring consistency in data across sources. Users can specify data inputs and schema mappings, making sure that the matching workflow aligns with their specific requirements.

Data Protection: AWS Entity Resolution offers robust data protection features, including hashing and encryption for every data input. This helps users make sure their data remains protected during the matching process.

Data Regionalization: AWS Entity Resolution’s support for data regionalization is vital for HLCS organizations. For example, making sure that sensitive genetic data is accurately linked and matched within the same region where it resides. This adheres to data sovereignty and complies with regional health data regulations, safeguarding data integrity and privacy while facilitating secure, global collaborative genomics research.

Advanced and configurable Matching Techniques: This service offers advanced matching techniques, including rule-based, machine learning (ML)-powered, and data service provider-led matching to accurately link and enhance related sets of healthcare information, research, testing, diagnosis and procedure codes, or facility data. This flexibility and choice of matching techniques allows healthcare organizations to adapt to different data scenarios.

  • Ready-to-use Rule-Based Matching: This matching technique includes a set of ready-to-use rules in the AWS Management Console or AWS Command Line Interface (AWS CLI) to find matches, based on your input fields. Healthcare organizations can fine-tune these rules to meet their unique needs, simplify the process of finding related records based on input fields, and ensure that matching accuracy satisfies their requirements.
  • ML-Powered Matching: A pre-trained ML model is used to find matches across multiple data inputs. This is helpful in performing matching on patient’s records, providing confidence scores for match quality.
  • Data Service Provider-led Matching: This workflow helps you link and enhance your records with datasets and IDs from trusted data service providers in a few clicks.

Manual and Automatic Processing: Users can initiate rule-based matches either through manual bulk processing or automatic incremental processing to keep entities up-to-date as new data arrives over time.

Near Real-Time Lookup: The service offers near real-time lookup capabilities for rule-based matching, allowing users to retrieve existing match IDs synchronously, enhancing the efficiency of data retrieval.

AWS Entity Resolution use cases with health data

AWS Entity Resolution helps healthcare and life sciences customers unlock new use cases, such as the following:

Connected patient records: AWS Entity Resolution empowers healthcare organizations to establish a unified view of patient interactions by linking events like medical appointments, lab test results, insurance claims and more to a unique match ID. This facilitates improved tracking of patient data across various healthcare providers, insurance companies, and pharmaceutical services, enhancing the overall accuracy of patient records and healthcare operations.

Accurate Longitudinal Patient Journeys: Using AWS Entity Resolution, healthcare payers and providers can build 360 degree and longitudinal maps of patients’ events and inputs. The goal is to enhance their care by linking disparate data sets. For example, data may be sourced from across member institutions of an academic medical center network. The service utilizes matching techniques to facilitate this. Consequently, the academic medical center network can create an integrated, comprehensive record for each patient. This improved record-keeping supports better diagnoses, wellness management, and care coordination. Ultimately, it enriches the overall patient journey.

Optimized clinical development and research records: New medicines and outcomes-based research rely on accurate and connected data records. Scientists leverage these to design studies, perform analyses, and extract insights, ultimately improving clinical research approaches or identifying common trends across cohorts for clinical trial recruitment. AWS Entity Resolution offers different matching techniques to help accurately link disparate data sources, fostering a unified view of research data. This aids in minimizing data discrepancies and redundancies while enhancing the reliability of research outcomes. For example, researchers and clinicians can more effectively track, analyze, and predict patient responses, contributing to the development of personalized medicine and the optimization of therapeutic strategies.

Linked pharmaceutical codes: Pharmaceutical laboratories, biotechnology companies, clinical research institutions and their respective supply chain rely on multiple different classifications, identifiers, and codes to uniquely identify medications and active ingredients. These vary per region, country, and agency (ATC, NDC, SNOMED, or DIN). Using AWS Entity Resolution, organizations can map and link data sets containing identifiers into unique entities to perform analyses and research, or optimize their supply chain.

Interoperability mandates

The US healthcare sector is navigating through a transformative period, with regulations dynamically shaping the adoption of the Fast Healthcare Interoperability Resources (FHIR) data format across various stakeholders, encompassing EHR vendors, providers, and health plans. Regulatory frameworks, such as the Centers for Medicare & Medicaid Services (CMS) Interoperability and Patient Access final rule, as well as forthcoming legislations, are fostering a broader, inclusive push toward standardizing health data interoperability.

This not only encompasses healthcare payers but also extends to health plans and EHR vendors, each facing their respective, specialized regulatory guidelines. These regulations increasingly advocate for support of access to vital data elements crucial for patient matching, such as name, phone number, and address. This emergent landscape is one where FHIR adoption is not only a technical transition but also a comprehensive shift making sure of streamlined, secure, and standardized data accessibility across the multifaceted healthcare ecosystem.

AWS HealthLake

AWS can enable healthcare systems to meet the required interoperability mandate with services such as AWS HealthLake. Using the HealthLake FHIR-based APIs, healthcare organizations can easily import large volumes of health data, such as medical reports or patient notes, from on-premises systems to a secure, compliant, and pay-as-you-go service in the cloud. By leveraging HealthLake, healthcare systems can not only meet healthcare mandates but also use built-in natural language processing (NLP) models to help customers understand and extract meaningful medical information to drive innovation and improve patient care in a secure and efficient manner.

Ingest Health Data with Ease: Healthcare systems can efficiently import health data, including clinical notes, lab reports, insurance claims, and more, to an S3 bucket. This bulk import capability simplifies data acquisition for downstream applications and workflows.

FHIR REST API Operations: AWS HealthLake supports the FHIR REST API operations, allowing healthcare systems to perform CRUD operations on their data stores. This includes the ability to perform FHIR searches, enabling efficient data retrieval.

Secure, HIPAA-Eligible Storage: AWS HealthLake makes sure that data is stored in the AWS Cloud in a secure, HIPAA-eligible manner. It adheres to the FHIR format, making data queryable and structured in the R4 FHIR standard format.

Transform Unstructured Data: AWS HealthLake features integrated medical natural language processing (NLP) using Amazon Comprehend Medical. This transforms raw medical text data into structured information, extracting entities, entity relationships, and entity traits from medical text. Then, this data is organized into new resource types, enhancing data accessibility.

Case Study: FHIR patient entity resolution

In this section, we present a solution that leverages AWS Entity Resolution to perform entity resolution for patient records stored in AWS HealthLake. The implementation of entity resolution within AWS HealthLake serves as a critical foundational element that ensures data integrity across the data store. An “entity” in this context can denote a singular patient, provider, organization, or healthcare facility. Entity resolution is the pivotal process of determining whether multiple records within AWS HealthLake pertain to the same real-world object, such as a patient or provider. For example, our healthcare customers have told us that they are challenged by matching patients across data sources that originate from multiple internal systems or even multiple organizations.

Using AWS Entity Resolution, this project addresses this challenge by employing an ML-based matching algorithm to accurately identify and link disparate patient records, enhancing AWS HealthLake’s ability to establish comprehensive patient profiles with confidence scores. This enables accurate and cohesive healthcare data management. This process is one of the required steps in the broader processes known as Master Data Management (MDM), Enterprise Master Patient Index (EMPI).


The following diagram describes the architecture of this patient entity resolution solution, which leverages AWS native services and aligns with the AWS Well-Architected framework, making sure of robust architecture across key dimensions, such as security, reliability, performance efficiency, and cost optimization.

Figure 1 Architecture diagramFigure 1: Architecture diagram

The solution includes the following high-level steps and AWS native services:

  • Fetch patient identifier information from AWS HealthLake data store using Amazon Athena SQL query.

The Amazon Athena query runs against the AWS Lake Formation resource link database, which is automatically created when a HealthLake data store is first spun up.

  • The query result dataset is saved in an S3 bucket as a CSV file. The identifier attributes of the patient FHIR resources used for the query could include attributes like HealthLake Patient resource ID, name, address, phone number, date of birth, and gender.
  • Present the patient dataset to AWS Entity Resolution.

Once the patient dataset has been created in the previous step, we use an AWS Glue crawler to crawl the dataset and populate an AWS Glue Data Catalog table. Then, this table is ready for ingestion into the AWS Entity Resolution service.

  • Generate ML-driven matches with AWS Entity Resolution.

An AWS Entity Resolution schema mapping and a matching workflow have been created in this solution to define how to match the input patient data and where to write the match results. By default, this solution uses the pre-configured ML-based matching technique to find matches across the input patient dataset. An AWS Lambda function triggers a job of the matching workflow and writes the results, with the AWS Entity Resolution match ID and a confidence level, to another S3 bucket. You can also use the rule-based matching technique in the matching workflow to define your own matching rules and find exact matches that meet your entity resolution requirements.

  • Insert AWS Entity Resolution match-IDs into the AWS HealthLake patient FHIR resources.

Once AWS Entity Resolution has identified matching patient records, the solution uses a Lambda function to read and parse the AWS Entity Resolution results. Then insert the AWS Entity Resolution generated match-IDs with the associated confidence scores back into the patient FHIR resources new identifier attributes. This lets you easily identify and link matching patient records across your AWS HealthLake data store.


Before deploying this solution, you need the following information to use as input parameters to an AWS CloudFormation template:

  • The data store ID of a AWS HealthLake data store that you want to perform patient entity resolution.
  • The database name and the shared resource owner ID (or catalog ID) of the AWS Lake Formation database that is linked to the AWS HealthLake data store, as shown in the following figure.

Figure 2 Screenshot to locate Lake Formation database name and shared resource owner ID

Figure 2: Screenshot to locate Lake Formation database name and shared resource owner ID


To implement this solution, you can deploy this AWS CloudFormation template.

The output of this template includes an AWS Step Function, such as ahl-entity-resolution-state-machine. You can execute this state machine on demand to run the solution and perform patient entity resolution for your AWS HealthLake data store. This template also creates an AWS EventBridge scheduler to automatically trigger the state machine regularly, such as every night at 10 o’clock. You can modify the schedule of this scheduler to run the solution based on your business needs.

Verify results

To check the matched patient records identified by this solution, you can do one of the following:

  • Go to the AWS CloudWatch Log Group linked to the Step Function. The log group contains detailed information about the execution of the Step Function, including the input and output of each step.
  • Go to the Execution page of the Step Function and check the Output of the last step of the state machine. The last step of the state machine generates the match results, which include the matched patient resource IDs (as source_id) and the match_id returned by AWS Entity Resolution.

Figure 3 Screenshot of Step Function execution output

Figure 3: Screenshot of Step Function execution output

Once you have identified the patient resource IDs from the AWS Entity Resolution matching output, you can go to the AWS HealthLake data store to query the patient resource by using the previously identified patient resource IDs. You can see that a new identifier attribute is created for the patient from AWS Entity Resolution with the match_id showing as the identifier attribute value.

The match ID returned from AWS Entity Resolution remains the same for a source patient record across multiple workflow runs, unless you change the matching workflow configuration or the patient record is significantly updated, as shown in the following figure. These Match IDs are for correlating internal patient records within HealthLake data stores and should not be used as identifiers outside HealthLake in downstream or external systems.

Figure 4 Screenshot of HealthLake query showing entity resolution match ID

Figure 4: Screenshot of HealthLake query showing entity resolution match ID.

We also built a sample Amazon QuickSight dashboard to demonstrate that multiple patient records in HealthLake data store are matched to the same match_id returned by AWS Entity Resolution based on the new identifier attribute which was then inserted into HealthLake by this solution, as shown in the following figure.

Figure 5 A sample QuickSight dashboard

Figure 5: A sample QuickSight dashboard demonstrating multiple patient records matched by the same AWS Entity Resolution match ID.

This solution provides a baseline for your patient entity resolution solution in HealthLake. It is a flexible and extensible framework that you can use to build your own applications and workloads on top of it. You can enhance or modify the solution to meet your specific healthcare entity resolution requirements.

Cleaning up

To avoid additional infrastructure costs associated with the example in this post, make sure to delete the CloudFormation stack, and any other manual resources that might have been added as prerequisites.

Next Steps

To embark on this transformative journey, explore our resources on AWS Entity Resolution and AWS HealthLake such as documentation, webinars, videos, and other posts. AWS HealthLake can handle other healthcare analytic needs such as advance pediatric care. Check out this post “Advance pediatric care using Amazon HealthLake for scalable FHIR-based data analytics” that walks through how it can be done. Read the blog: “AWS Entity Resolution: Match and Link Related Records from Multiple Applications and Data Stores.” For a hands-on approach, check out workshops for AWS Entity Resolution, AWS HealthLake, AWS HealthLake Patient Matching.

Conclusion: AWS Entity Resolution and AWS HealthLake

AWS Entity Resolution and AWS HealthLake can be seamlessly integrated to provide healthcare organizations with a comprehensive solution for managing, structuring, and accurately resolving entities within their healthcare data. This integration enhances data accuracy, improves care coordination, makes sure of compliance with regulations, and empowers healthcare companies to leverage their data effectively for research, innovation, and delivering high-quality patient care.

Tyler Replogle

Tyler Replogle

Tyler Replogle is a Senior Solutions Architect & Technical Databases Leader at AWS for the World-Wide Public-Sector. He enables customers and partners to run their end mission solutions on AWS. He enjoys building and has found ways to connect with his three daughters through building with Lego, Minecraft, and coding.

Kai Xu

Kai Xu

Kai Xu – Kai is a Senior Solution Architect with AWS supporting the Academic Medical Centers customers. Kai has more than 15 years of experience in the Healthcare industry and passionate about Information Security, Compliance, and Cloud migration. In his free time, Kai enjoys reading, soccer games, and having fun with his kids.