Data De-Identification in Healthcare: A 360-Degree View from Apexon

By Tanmay Baxi, Practice Head, Cloud & Platform Engineering – Apexon
By Apurv Doshi, Practice Head, Labs (Innovation and R&D) – Apexon
By Venkat Gomatham, Solutions Architect – AWS

Apexon

In the healthcare industry, patient data collected from doctor visits, diagnosis, and medication, as well as the data collected through wearables such as heart rate, blood pressure, body temperature, and oxygen saturation, provides valuable insights that could play a huge role in enhancing our healthcare system.

Healthcare data is very complex, though, and carries sensitive information like personally identifiable information (PII) and protected health information (PHI). This makes it difficult to democratize healthcare data across certain boundaries of an enterprise and beyond.

The ability to share this data is vital to extracting insights and intelligence through analytics. However, enterprises do not have the necessary tools to tackle this complex problem while complying with tough security and industry regulations.

Apexon’s data anonymization and de-identification solution for healthcare data uses sophisticated machine learning (ML) algorithms to solve this problem while allowing organizations to meet compliance and regulatory requirements. This solution takes away the heavy lifting from healthcare organizations so they can concentrate on what really matters most. This post illustrates how Apexon’s solution works and its benefits.

Apexon is an AWS Life Sciences Competency Partner and digital-first technology services firm that specializes in accelerating business transformation and delivering human-centric digital experiences. For over 17 years, Apexon has been meeting customers wherever they are in the digital lifecycle and helping them outperform their competition through speed and innovation.

Data Anonymization and Data De-Identification

Due to the digitalization of healthcare data, the rise of end-user concerns about privacy, an increase in information breaches, and tight industry and cyber security regulations, it becomes an organization’s job to protect patient and/or healthcare records.

Apexon uses data anonymization and data de-identification techniques to mask such information from the data while preserving its integrity to perform meaningful analytics and share it across enterprise boundaries while complying to regulations.

Apexon’s solution performs lexical analysis of the data to recognize PII/PHI and automatically anonymizes the data using customizable rules provided by the organization.

Figure 1 – Apexon’s data de-identification logical diagram.

As you can see, the solution identifies the sensitive information for different types of data, like electronic health records (EHR), electronic medical records (EMR), text, images, audio, and video files.

In the case of audio, video, and image files, data is extracted and analyzed for sensitive information. The extracted data is then transformed based on the provided business rules leveraging one of three techniques: de-identification, anonymization, or generalization.

This solution also provides a way for users to validate the data for accuracy before creating sharable datasets. In audio and video files, the solution automatically adds a beep sound where sensitive information is revealed. For video and image files, a blur is added for faces and text, redacting sensitive and personally identifiable information.

Apexon’s solution follows the safe harbor methodology. The HIPAA safe harbor provision is part of the HIPAA Privacy Rule, which limits the possible uses and disclosures of protected health information.

The HIPAA safe harbor method is a method of de-identification of protected health information, which provides prescriptive guidance on how certain data elements need to be de-identified. Per that guidance, the following fields are anonymized before sharing across organizations or entities.

Figure 2 – Anonymized data fields as per safe harbor methodology.

Solution Architecture

Healthcare data is extremely fragmented and may become available in the form of SQL or NoSQL databases. The availability of the data is also possible as images (DiaCom images), audio (doctor-patient transcript), and/or video (procedural or education videos).

EHR/EMR are a digital version of the paper charts in the clinician’s office. EMR contains the medical and treatment history of the patients in one practice, while EHR focuses on the total health of the patient—going beyond standard clinical data collected in the provider’s office and inclusive of a broader view of a patient’s care. EHRs are real-time, patient-centered records that make information available instantly and securely to authorized users.

When the data resides in a database management system (DBMS), the data change is captured using Debezium and streamed via managed Kafka service to an Amazon Simple Storage Service (Amazon S3) bucket (raw data lake). If the data is available in file form (image, audio, video, or flat file of records), the data is ingested to the raw data lake via Amazon Kinesis Data Stream and Amazon Data Firehose.

While the data is streamed from source to destination, the required metadata is also processed. This metadata may contain the required information to reconstruct the source data from the de-identified data. The manipulation and processing of the metadata is performed via AWS Lambda functions.

As soon as the raw data is inserted into data lake, the event is triggered. It will put the new data into the Kafka queue for the de-identification process.

A custom ML model detects PII from the structured and unstructured data and is trained using Amazon SageMaker and stored in Amazon S3. When the new data enters the system, these ML models analyze the data and identify PII/PHI. Based on the configuration provided by the user, further processing happens, including generalization of data (dd/mm/yy detail converted to mm/yy format), dropping of data (removal of complete column), and/or de-identification of data (encrypt the field but keep the format intact).

Once the processing finishes, the de-identified data is stored into S3 (destination bucket) which triggers a Lambda function to store the data in the destination client system (SFTP).

Figure 3 – Apexon data de-identification solution architecture.

Advantages of Apexon’s Solution

End-to-end solution that takes care of recognizing and de-identification.
Custom rules engine provides a wide array of choices for different types of data fields that help users select the most appropriate option. For example, if the address is identified from the data, one can select from any of the following choices:
- Country
- State and country
- County and state
In the same way, if the date is identified, users can go with any of the below choices:
- Discard date and go with month and year
- Quarter and year
- Only year
Similar choices will be presented like drop the field, encrypt the field, and generalize the field:
The solution can work with streamed and batched data.
Apexon’s custom authorization mechanism and proprietary algorithms help re-identification of the data to appropriate stakeholders.
The entire framework can be used with data pipeline and headless mode to make it part of the custom data export pipeline in the source system and custom data ingestion pipeline in the destination system.
The custom algorithms make sure the context of the data (specifically the referential integrity in the relational database) remains intact even after the de-identification of the data. This is important for the holistic study of the data.

Conclusion

The exchange of data, specifically in the healthcare domain, incurs risks as it contains personally identifiable information (PII) and protected health information (PHI). At the same time, not exchanging the data can keep valuable insights hidden.

Apexon’s solution ensures the exchange of the data happens without any risk of PII/PHI being exposed. This opens many avenues to extract intelligence from the data. This solution is highly customizable to fit the needs of the organization.

.
.

Apexon – AWS Partner Spotlight

Apexon is an AWS Advanced Tier Services Partner with a Life Sciences Competency and digital-first technology services firm that specializes in accelerating business transformation and delivering human-centric digital experiences.

Contact Apexon | Partner Overview | AWS Marketplace

AWS Partner Network (APN) Blog

Data De-Identification in Healthcare: A 360-Degree View from Apexon

Data Anonymization and Data De-Identification

Solution Architecture

Advantages of Apexon’s Solution

Conclusion

Apexon – AWS Partner Spotlight

Resources

Follow

Learn

Resources

Developers

Help