AWS for Industries

Build a Unified Patient Index Using AWS Entity Resolution

Healthcare organizations and public health agencies handle vast amounts of patient data. Accurately managing and linking sensitive patient information across multiple, potentially disconnected, sources are crucial for care coordination, research, and public health efforts. Building a unified and accurate unified patient index enables healthcare providers to access comprehensive patient histories, researchers to assemble robust datasets, and public health officials to gain insights into disease trends and outcomes.

However, inconsistencies in data entry can pose significant challenges to accurate patient identification. These issues extend beyond variations in name spellings and formatting differences to include the use of nicknames and honorifics, discrepancies in contact information across interactions, and inconsistent field and abbreviation normalizations for addresses. Different data entry inconsistencies can lead to fragmented patient records, potentially compromising the quality of care and patient safety. These challenges can result in:

  • Incomplete patient records
  • Non data-driven decision-making in patient care
  • Difficulties in addressing population health challenges
  • Barriers to effective research
  • Inefficient care coordination
  • Increased healthcare costs
  • Reduced patient satisfaction

Building a master patient index (MPI) can also help solve these challenges because an MPI serves as a centralized registry that assigns a person-based persistent identifier assigned to records pertaining to an individual. Matching partial records to indexed patient records allows the creation of a unified, continuously evolving view of a patient enabling effective healthcare coordination and research amongst downstream consumer applications.

Using AWS HIPPA eligible services, healthcare organizations can process patient records with AWS Entity Resolution, unify member information using Amazon Connect Customer Profiles, and deliver personalized and timely patient care with Amazon Q in Connect capabilities.

With AWS Entity Resolution, companies and organizations can match, link, and enhance related customer, or healthcare records that exists across multiple applications, channels, and data stores. The service provides flexible and configurable rule-based, machine learning (ML)-powered, and data service-provider matching techniques to both help improve data accuracy and enhance related records based on customer’s business needs. Entity Resolution enables customers to configure the entity matching techniques, helping organizations overcome challenges associated with manual data entry and low data quality. The service improves the security posture of a healthcare data architecture by minimizing data movement. It integrates seamlessly with existing healthcare data architecture patterns by leveraging widely popular AWS services like Amazon Simple Storage Service (Amazon S3), and AWS Glue.

Once patient records are processed using AWS Entity Resolution, healthcare companies can then proactively service and anticipate patient care needs, leveraging capabilities in Amazon Connect. Using Amazon Connect Customer Profiles, healthcare organizations can unify member and patient information from multiple sources with necessary patient consents. By integrating Amazon Q in Connect with Amazon Connect Customer Profiles, healthcare companies can help detect patient needs in real-time to deliver timely and personalized patient care.

Member engagement across the entire health care journey

In this blog, we will demonstrate how independent software vendors (ISVs), providers, and payers can use AWS Entity Resolution to identify and match related patient records using a publicly available, synthetically generated dataset.

High Level Architecture DiagramFigure 1 – High Level Architecture Diagram

Patient dataset

For this solution example, we are using the synthetic dataset created for the Childhood Obesity Data Initiative (CODI) project. Synthea, an open-source synthetic patient generator that models the medical history of synthetic patients, was used to generate multiple split records for some individuals. In those split records, the demographic information may vary in ways that are expected in a real-world system. For example, a given name may be “John” in one instance of the record and “Johnny” in another.

Patient dataset structure

The patient data used in our example has been converted from FHIR format to CSV for analysis. The dataset contains approximately 6,300 records with columns containing personally identifiable information (PII) needed for matching patients across the dataset.

The following table outlines the structure of the patient data. It includes fields such as statename, postalcode, address, countryname, cityname, birthdate, uniqueid, firstname, middlename, surname, resourcetype and phonenumber. These fields are commonly used in the entity resolution processes to link records referring to the same individual. The size and variety of fields presented in the data make it suitable for demonstrating potential entity matching techniques.

Figure 2 – Sample data from the synthetic dataset Figure 2 – Sample data from the synthetic dataset

In order to run an AWS Entity Resolution workflow, the given patient data was uploaded to an (Amazon S3) bucket. The AWS Glue crawler then processes the file to automatically determine its schema and update the metadata in the AWS Glue Data Catalog as a table. Next, we’ll navigate to the AWS Entity Resolution console experience.

In the AWS Entity Resolution console, select the Schema mappings option from the menu, and click Create schema mapping. Schema mapping informs the service about the source data being used for resolution, and the attributes it contains.

Figure 3 – AWS Entity Resolution Create schema mappingFigure 3 – AWS Entity Resolution Create schema mapping

Within the ‘Create schema mapping’ screen, choose the appropriate AWS Glue database and table representing the source data. For this post, we used a database named “demodb” which contains the “patientdata” table. This database was created when we ran the AWS Glue crawler on our S3 bucket containing the patient data.

Figure 4 – AWS Entity Resolution create schema mapping configuration Figure 4 – AWS Entity Resolution create schema mapping configuration

Next, select the Unique ID from the dropdown. The unique ID column should distinctly reference each row of the data—think of this as the primary key column in a database. In this case, it is the uniqueid in the CSV file.

Figure 5 – AWS Entity Resolution create schema mapping, uniqueID selectionFigure 5 – AWS Entity Resolution create schema mapping, uniqueID selection

Next, scroll down and select the input fields that are required to participate for resolution (Figure 6). In this case, columns describing the patient demographic information are chosen, such as firstname, middlename, statename, surname, countryname and homeaddress.

Figure 6 – AWS Entity Resolution schema mapping, matching columns Figure 6 – AWS Entity Resolution schema mapping, matching columns

Additionally, any columns that are not required for resolution but are required in the final output file can be selected under the passthrough fields section. In our example we selected birthdate, cityname, contactemailaddress, contactfamilyname, contactname, gender, linkid, maritalstatus, phonenumber, postalcode and resourceid. These columns do not participate in the matching process, but appear as part of the output.

Figure 7 – AWS Entity Resolution schema mapping, passthrough columnsFigure 7 – AWS Entity Resolution schema mapping, passthrough columns

In the next step of creating a schema mapping, map the selected input fields to their appropriate data types and match keys. Specifying the input type (such as name, email, address, and so on) informs AWS Entity Resolution how to interpret the data in each column and, optionally, what normalization rules can be applied on that particular column. The match key determines which fields are similar and needs to be considered as a single unit during the matching process.

Note: If non-personally identifiable information (PII) fields need to be used for resolution, one may select those fields as “Input Fields”. Select Custom String as the input type and provide an appropriate match key name to it. Support for Custom String is available only for rule-based matching technique, and ignored by machine learning-based matching.

Figure 8 – AWS Entity Resolution schema mapping, map input fields to the service input type Figure 8 – AWS Entity Resolution schema mapping, map input fields to the service input type

Click next to proceed to create a group. A group is a set of related input fields like First Name, Middle Name and Last Name under a single “Name” column. Doing this will enable AWS Entity Resolution to compare them collectively, rather than individually during matching and similarity calculations—typically resulting in more accurate matches.

Figure 9 – AWS Entity Resolution schema mapping, group definition for nameFigure 9 – AWS Entity Resolution schema mapping, group definition for name

Similar to grouping the name fields, also create a group for the “Address” fields and select homeaddress, statename, and countryname as the input fields (Figure 10).

Similar to grouping the name fields, also create a group for the “Address” fields and select homeaddress, statename, and countryname as the input fields (Figure 10).Figure 10 – AWS Entity Resolution schema mapping, group definition for address

Once the group configuration is set, click Next, to navigate to the Review and Create screen. Review all the configurations and click on Create schema mapping. This will create the schema mapping.

Once the schema mapping has been created, the next step is to create a matching workflow. A matching workflow helps define the input the associated matching technique, rule, or machine learning need to match and link records across the sources. To create a matching workflow, select Matching from under the Workflows dropdown in the left side menu, and click on the Create matching workflow button (Figure 11).

Figure 11 – AWS Entity Resolution create matching workflow Figure 11 – AWS Entity Resolution create matching workflow

In the matching workflow screen, begin creating the workflow by giving it a name and a description. In our example we called it patient-data-matching-workflow.

Figure 12 – AWS Entity Resolution create matching workflow. Define name and descriptionFigure 12 – AWS Entity Resolution create matching workflow. Define name and description

Next, select the appropriate AWS Glue database, AWS Glue table, and the corresponding schema mapping created earlier. This step informs the AWS Entity Resolution service of the location of the source data, and how to parse and interpret it using the schema mapping definition.

Figure 13 – AWS Entity Resolution create matching workflow, define input sources Figure 13 – AWS Entity Resolution create matching workflow, define input sources

Provide AWS Entity Resolution with the necessary access permissions. If you are running the service for the first time, select “Create and use a new service role”. This option allows the service to automatically create an IAM role, granting it access to the specified Amazon S3 bucket(s) for input/output and the AWS Glue database/table as the raw input source. The service role name will be auto-generated and you can edit it if needed. More details on creation of an IAM Role can be found in our user guide.

Figure 14 – AWS Entity Resolution create matching workflow, IAM Role selection Figure 14 – AWS Entity Resolution create matching workflow, IAM Role selection

After selecting the IAM role option best for your situation, click Next to navigate to the next page. On this page, choose the appropriate matching technique between rule-based and machine learning-based matching, to perform resolution on your source data. In this case, choose the Rule-based matching technique to deterministically identify records belonging to the same patient.

AWS Entity Resolution create matching workflow, select rule learning-based matching techniqueFigure 15 – AWS Entity Resolution create matching workflow, select rule-based matching technique

For Matching rules, enter a Rule name and then choose the Match keys for that rule. You can create up to 15 rules and you can apply up to 15 different match keys across your rules to define match criteria. And for Comparison type, Select Multiple input fields option which matches the data stored in multiple input fields, regardless of whether the data is in the same or different input field.

choose the Match keys for that ruleClick Next to navigate to the next page. On this page, configure the output Amazon S3 bucket location, where the service would write the results of the resolution. Select the output data format as Normalized data. This option normalizes output records by removing special characters, extra spaces, and formatting all values to lowercase for swifter consumption downstream. Optionally, you may customize the normalization library by following our Guidance for Customizing Normalization Library for AWS Entity Resolution.

Figure 16 – AWS Entity Resolution create matching workflow, output configurationFigure 16 – AWS Entity Resolution create matching workflow, output configuration

As the final step prior to creating the workflow, review all the configuration settings to confirm they accurately reflect your matching requirements, and click Create and run. This will create the matching workflow and initiate the first run.

After allowing some time for the job to complete (Figure 17), job metrics show the number of input records, and the number of uniquely matched IDs generated. The output is written to the configured Amazon S3 bucket. You may navigate to the specified output S3 location and download the output files to analyze the result.

Figure 17 – AWS Entity Resolution matching workflow run statistics Figure 17 – AWS Entity Resolution matching workflow run statistics

In the output data (Figure 18), each record has the original unique ID (the uniqueid column) and a newly assigned matchid. Matching records, related to the same patients, have the same matchid. The matchrule field describes the rule applied that generated a matched record set.

AWS Entity Resolution Output matched dataFigure 18 – AWS Entity Resolution Output matched data

This matched data can be a valuable asset for healthcare organizations and public health agencies. They can load the identified matches from the entity resolution output into their healthcare systems, such as immunization information systems (IIS), disease surveillance platforms and vital record systems. The systems can then utilize the matched data to identify potential matches and present them to the user. This allows healthcare staff to review, merge, and resolve potential matches, which can improve the accuracy and completeness of a patient’s data.

By leveraging the matched data, organizations can enhance analytics to drive better interventions and improve health outcomes. For example, with data linked across disparate data sets, public health could better identify risk factors for severe COVID-19 by linking immunization data, hospital discharge data and disease surveillance data.

Conclusion

AWS Entity Resolution helps address challenges such as fragmented records, non data-driven decision-making, research barriers, care coordination misalignment resulting from incorrect data and increased costs. As seen through this example, healthcare organizations and researchers can use AWS Entity Resolution to effectively link and match patient records from multiple, diverse data sources. This enables them to create a comprehensive, longitudinal view of an individual’s health history and outcomes—potentially leading to better overall care.

Contact an AWS Representative to know how we can help accelerate your business.

Further Reading

Venkata Kampana

Venkata Kampana

Venkata is a senior solutions architect in the Amazon Web Services (AWS) Health and Human Services team and is based in California. In this role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Jim Daniel

Jim Daniel

Jim is the public health lead at Amazon Web Services (AWS). Previously, he held positions with the United States Department of Health and Human Services (HHS) for nearly a decade, including director of public health innovation and public health coordinator. Before his government service, Jim served as the chief information officer for the Massachusetts Department of Public Health.

Punit Shah

Punit Shah

Punit is a Senior Solutions Architect at Amazon Web Services, where he is focused on helping customers build their data and analytics strategy on the cloud. In his current role, he assists customers in building a strong data foundation layer using AWS services like AWS Entity Resolution, and Amazon Connect. He has 15+ years of industry experience building large data lakes.