Philips and AWS Automate PHI de-identification with machine learning

Blog guest authored by Shawn Stapleton, PhD, global Data Science and Innovation lead for Philips

Properly de-identified electronic health record (EHR) data is imperative to curate data sets for use in creating insights into population health. Being able to automate this incredibly manual and time-consuming process would speed up your health informatics innovations and time-to-market.

Medical Record Data Sets Today

The adoption of artificial intelligence (AI) in healthcare is driving the curation of large medical record data sets. The use of medical record data to train machine learning algorithms requires mitigation of privacy risks to individuals represented in the data set.

In the United States of America, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule outlines the standard for de-identification of protected health information (PHI). Safe Harbor is one of the methods by which health information can be designated as de-identified as provided by the HIPAA privacy rule. It outlines the removal of specified individual identifiers that could be used alone (or in combination with other information) to identify the individual.

Introduction

The process of de-identification is manually intensive, requiring the isolation of individually identifiable data elements, and removing substantial context and structure to conceal identifiers. As a consequence, de-identification reduces the scalability and generalizability of AI solutions in healthcare.

As part of a strategic collaboration, Philips and Amazon Web Services (AWS) have developed a solution that leverages machine learning to automatically detect, classify, and conceal PHI in large medical record data sets. Deep neural network architectures, including natural language processing (NLP) models, are leveraged to detect and classify PHI in Health Level 7 (HL7) data.

The team developed AI/Machine Learning (ML) algorithms for this use case to demonstrate the savings in manual effort when de-identifying protected health information.

Figure 1 – Detection of PHI in HL7 Messages for optimal de-identification

Data Preparation

A data set of medical records consisting of 34,000 HL7 messages was curated and manually de-identified In collaboration with a us based health institution. Under their instruction, PHI/Identifiers were concealed with synthetic phrases corresponding to semantically relevant de-identified entities, which retains the original context within and across patient medical record data. Synthesized data was generated for each PHI data element based on the data element entity type.

Figure 2 – Table with Data Types and Specialized Generation Method

We generated the names using lists from the IMDB persons dataset [1], combined with names available via the Python Faker API. We sampled the names until exhausted (without replacement), after which the list of names was re-initialized. The size of the message sample, and the size of the hospital from which we draw, influenced the number of unique patients.

Since 560,000 first names and more than 11 million last names were used, it is possible that names were repeated in one of these fields. However, the probability of a full name being repeated remained very low.

The name suffix, city, state, and zip code were drawn from a list of geographically relevant values for the given hospital. Street addresses were randomly generated using the Python Faker API. Hospital IDs were replaced with random alphanumeric character sequences in the same format.

The alphanumeric values were replaced by using a formatting string with the following simple conventions:

“A” indicates an upper-case letter
“a” indicates a lower-case letter
“9” indicates a digit 0-9
All other characters are left unchanged

For example, the format “Aaa-99-99” could generate the following values: “Edd-23-45”, “Rjj-12-00”.

PHI Detection using ML

The Safe Harbor 18 identifiers were grouped into five entities and used as labels for machine learning. An additional entity representing non-PHI data elements was included (Figure 2).

Figure 3 – Table with Label Name, Category and Description

The ProServe team trained a character-level convolutional neural network (char-CNN) to predict the entity label for each data element. This allows for flagging of the detected PHI based on the predicted entity type. The char-CNN for text classification is a better choice for our task than models with word-level embedding. This is because of the smaller discrete space in the texts we worked with. The char-CNN model generalizes better when using out of vocabulary (OOV) words like ID’s and medical numbers.

The model to detect PHI was trained on entity values and the resulting output was a vector of the predicted entity probabilities. The length of the vector for each output was the number of fields (or classes) the model was trained on. For this char-CNN, the sequence of encoded characters was used as input to the model.

The character-level embedding used the one-dimensional convolutional neural network (1D-CNN) to find numeric representation of the field values by probing their character-level compositions. The model uses six convolutional layers and three fully connected layers with two dropout modules for regularization. The last layer uses a Softmax activation function in a multi-class classification setting.

The model was trained for 10 epochs and evaluated using a hold-out test set. The following (Figure 4) are the model evaluation metrics using the model trained on six labels. The accuracy and macro average F1-score from evaluating the model on the unseen test set are 0.9865 and 0.9868 respectively. The model performed very well across all the different entity types with Precision and Recall greater than 0.9. The precision for the “freetext” entity type is the lowest among the six categories with a score of 0.9015. This is because of the structure and variability in the character composition of the freetexts contained in the dataset.

Figure 4 – Evaluation Metrics for Model Trained on Six Labels

Model Analysis and Decision Thresholding

For the binary PHI classifier that distinguishes PHI from non-PHI, the team trained the model using the same approach as the one above. The decision for converting the predicted probabilities to class labels are governed by the parameter called the decision threshold. The evaluation metrics using a decision threshold of 0.5 is shown below (Figure 5).

For the case of a binary PHI classifier, test samples with probabilities greater than 0.5 are assigned “phi” entity type. The samples with probabilities less than 0.5 are assigned with “non-phi”. However, using this default threshold may not be optimal with respect to our de-identification goals.

Figure 5 – Evaluation Metrics for Model Trained on Binary (Two) Labels

Machine learning prediction algorithms are generally designed to maximize the number of correct predictions. However, in practice, PHI detection applications may emphasize high sensitivity or recall.

For a given PHI detection method, the precision, recall, and accuracy can be modified with a classification probability threshold. For this decision threshold, the performance is summarized using a 2×2 confusion matrix. Let TP and TN be the numbers of correct predictions for the positive and negative samples, respectively. Changing the decision threshold will results in significant changes to the TP and TN depending on the value of the threshold.

The plots below (Figure 6) shows predictive model performance for the PHI and non-PHI binary classification task. The accuracy details how often our model is correct in predicting a PHI entity or not. The number of missed PHI (Class 1’s) is approximately three out of every 1,000 samples.

In our use case, it’s imperative that the number of false negatives are reduced to negligible values. The samples not classified as PHI are then reviewed manually by humans, which can be implemented using automation by using Amazon Augmented AI (Amazon A2I). Amazon A2I brings human review to all developers. It removes the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers whether it runs on AWS or not.

Figure 6 – Predictive Model Performance (Binary Classification – non-PHI and PHI)

For the PHI detection analysis, an additional step is needed after setting the optimal decision threshold. This optimal threshold will be set by stakeholders depending on a cost function. This function defines the expected cost for a false positive error (FPR) and a false negative error (FNR). The total expected cost is the sum of the false positive cost and false negative cost. Overall, use of this automated method is expected to cut the average time it takes Philips to de-identify HL7 data elements by 67%.

Conclusion and Outlook

Through this use case, the team demonstrate a robust approach to automatically detect and remove PHI in healthcare medical record data using machine learning. We demonstrate the utility of our approach on HL7 messages, noting that the approach has the potential to scale to any data schema. The automated approach substantially reduced the time and effort in pinpointing PHI in large data sets while providing a configurable degree of de-identification.

Additionally, the approach overcomes poor data veracity, through the ability to pinpoint PHI in unexpected or unknown locations of a medical record. The ability to automatically detect PHI and de-identify medical records will facilitate the creation of large medical record data sets and accelerate the development of robust analytics and AI solutions in healthcare.

To learn more about AWS for Health—an offering of curated AWS offerings and AWS Partner Network solutions used by thousands of healthcare and life sciences customers across the globe—visit AWS for Health and AWS Healthcare Solutions or speak with an AWS Representative.

To learn more about Philips Interoperability Solutions—an offering to streamline workflows and improve care collaboration by enabling smooth data exchange across healthcare players, visit Philips Interoperability Solutions.

References/Acknowledgements:

[1] We agree to all the terms of IMDB’s copyright/conditions of use statement and this data is used for non-commercial purposes.

Shawn Stapleton, Ph.D. is the global Data Science and Innovation lead for Philips Data Management and Interoperability Solutions. He earned his MSc and PhD in Medicine, Biology, and Physics from the University of Toronto. Dr. Stapleton has extensive experience in data science and real-world application of AI across the healthcare continuum.