Masking Patient Data with DataMasque’s template for Amazon HealthLake

Healthcare organizations are moving their healthcare data to AWS in order to use the latest AWS services to improve care and provide more elegant patient and clinician experiences. However, regulations like the United States’ Health Insurance Portability and Accountability Act (HIPAA) and Europe’s General Data Protection Act (GDPR) mandate the need to protect sensitive patient health information and disclose how health data will be used. For healthcare customers building solutions with clinical data, this means you must provide your developers, analysts, researchers, and others with high-quality, production-realistic data to perform their jobs effectively, while also ensuring data is secure at all times.

De-identifying Protected Health Information (PHI) is a process used to accomplish this. If not done properly, however, de-identification can lead to the proliferation of PHI into non-regulated environments and increase the likelihood of experiencing a privacy and data breach. You must manage PHI to meet regulatory and compliance requirements in a way that enables the organization to innovate and solve its clinical challenges.

In this post, Brian, Snehanshu, and I’ll show you how to mask healthcare data for regulatory compliance using Amazon HealthLake and DataMasque.

Why use Amazon HealthLake with DataMasque

Amazon HealthLake is a HIPAA-eligible AWS service for analyzing healthcare data at scale. It uses the Fast Healthcare Interoperability Resources (FHIR) standard and enables customers to run SQL queries, build dashboards, and create models on their clinical data. If you want to use Amazon HealthLake using de-identified data, DataMasque can help. DataMasque is an AWS Partner with a proven data masking solution and an Amazon HealthLake partner.

DataMasque’s FHIR masking solution meets HIPAA, GDPR, and PHI requirements for health organizations by removing PHI and PII data from databases and S3 buckets and replaces it with synthetic or masked data. With DataMasque’s template for FHIR Patient resources, customers can start protecting the 18 identifiers defined within HIPAA and PHI right away, while still getting the clinical value from Amazon HealthLake.

The following architecture diagram shows unmasked PHI going through Amazon HealthLake and into DataMasque. DataMasque outputs production masked PHI, which can then be used in Amazon SageMaker, Amazon QuickSight, and AWS Lake Formation.

DataMasque FHIR masking solution overview

In DataMasque, customers are in full control over where and what data to mask. DataMasque is deployed on a virtual machine and can be run in either on-premises environments or in the AWS cloud.

Before sending sensitive data to Amazon HealthLake, organizations can either mask the data on-premises or mask the data in their AWS environment.

1. Mask on-premises

The following diagram shows an organization with sensitive data such as FHIR resources, HL7 CDA, insurance claims, and related information stored on their on-premise infrastructure. This organization can mask the data on-premises. This masked data can be saved on-premises in RDBMS, Data Warehouse, Parquet, ARVO*, JSON, XML, CSV and fixed-width format.

File types and applications that can be masked using DataMasque

This organization may also want to send unmasked data to AWS to enhance the experiences of patients and clinicians. However, it’s important to ensure the privacy and security of this data by masking it before pushing it into Amazon HealthLake. In such a scenario, you can mask the data on-premises before sending it to Amazon HealthLake.

2. Mask in your AWS environment

In the following diagram, the customer loads unmasked data into Amazon HealthLake, which is exported to Amazon S3 bucket using Amazon HealthLake’s import and export functionality. Once the data is in S3, DataMasque masks FHIR data, and the masked data is sent to Amazon HealthLake. The masked data is stored in a “clean” Amazon S3 bucket. This data can now be used in Amazon HealthLake and all downstream Amazon services.

Architecture overview using Datamasque with Amazon HealthLake

3. Use Masked data into downstream AWS services

In both scenarios, once the masked data is loaded into Amazon HealthLake, it is automatically added to your AWS Data Catalog, which allows you to use a range of downstream AWS services. You can now use this masked data in Amazon SageMaker, Amazon QuickSight, Amazon RDS, Amazon Aurora, Amazon Lake Formation, and Amazon Data Exchange. Refer to the following diagram.

Overview of using masqued data with various AWS resources

By using DataMasque in this architecture, you can transfer FHIR data from any environment on-premises or on AWS to a common “clean” Amazon S3 bucket. From there, you can use Amazon HealthLake on that data, as well as all of the other Amazon services downstream from HealthLake. In this example, we ran DataMasque on an FHIR Patient resource, and it changed the PHI while still maintaining the clinical value of the data. With that data, we then loaded it into Amazon HealthLake, and it automatically became part of AWS Data Catalog, enabling us to use it downstream AWS services.

Prerequisites

To start masking your FHIR data with DataMasque, you need the following AWS resources:

An AWS Account
An S3 bucket with unmasked data in FHIR R4 format. (You can download sample FHIR data here).
An empty S3 bucket with public access disabled

Solution walkthrough: Masking patient data with DataMasque’s template for Amazon HealthLake

You can mask FHIR data before and after the data is processed by Amazon HealthLake. To mask FHIR patient data using DataMasque’s built-in FHIR Patient masking template, follow these steps:

1. Deploy DataMasque in your AWS environment

Sign in to your AWS account. In a browser, navigate to AWS Marketplace and search for DataMasque or follow this link: DataMasque PHI Masking. In the upper right, choose Continue to Subscribe and follow the subscription wizard.
To access the deployed DataMasque instance, in a browser, do the following:
1. Open the Amazon EC2 console.
2. In the navigation pane, choose the EC2 instance hosting the DataMasque Instance.
3. In the Details pane, copy the Public IPv4 or Private IPv4 addresses.
4. In a web browser tab replace <instance-ip-address> in the following URL with the IP address copied in step 1.2.c: https://<instance-ip-address>

2. Prepare data source/destination and masking ruleset

In the DataMasque instance you accessed in step 1.2.d, at the top navigation, select File Masking Dashboard.
To create a Source Destination, in the Data Sources pane or Data Destinations pane, choose the + icon. Use the following parameters:
1. Connection name: a unique name for the connection on the deployed DataMasque instance.
2. Connection type: select AWS S3 Bucket from the dropdown list.
3. Base directory: select the target folder in the selected S3 bucket.
4. Bucket name: specify the name of the target S3 bucket.
5. For Use as, the following options are available. This option determines if this is a Data Source or Data Destination or both a Source & Destination connection:
  - Select Source for: out-of-place masking which DataMasque will read from them for masking. You must create a Destination connection separately if this option is selected.
  - Select Source & Destination for: in-place masking which DataMasque will read from and write out masked data to.
  - Select Destination for: out-of-place masking which DataMasque will write out the masked data to. You must create a Source connection separately if you choose this option. Related information: File Connections User Guide.

3. Perform a masking run on your data

Navigate back to the File Masking Dashboard. To do this, in the top navigation, choose File Masking.
To review the built-in FHIR Patient masking ruleset, in the Rulesets section, choose the pencil icon next to the fhir_patient_resource ruleset. You can modify the built-in FHIR patient masking template to capture any additional masking requirements. To save changes you make to the ruleset, select Save or Save And Exit. If you don’t make any changes, select Back to Dashboard.
Select the source connection you configured in step 2.2.
In the Rulesets section, select fhir_patient_resource ruleset. In the Data Destinations section, select the destination connection you configured in step 2.2.
In the bottom right corner, select the Preview Run button.
On the Confirm Run page, review information on the Source connection, ruleset name and the Data Destination connection. To proceed with the masking run, choose the START RUN button. Your masking run is in progress! The masking duration is dependent on how many Amazon S3 objects DataMasque is masking.
When the masking run is completed, in the Masking Run top right corner, the Status of the masking run changes to Finished with a green background color.

You can now import your masked data to AWS HealthLake. Find detailed instructions for creating AWS HealthLake Data Store and importing file into AWS HealthLake Data Store in the Amazon HealthLake Developer Guide.

You can now view the masked FHIR data. The following image shows an unmasked and a masked example of FHIR PHI data. The unmasked example shows a data structure that includes Joe, Bloggs, male, a birthdate of 1-9-1964, a city of Haverhill, a state of MA, and a zip code of 10830. The masked example shows a data structure including Mr. John Doe, male, a birthdate of 11-12-1964, a city of Boston, a state of MA, and a zip code of 02108.

an unmasked and a masked example of FHIR PHI data

In this example, you can see that given and family names changed, the date of birth changed while still keeping the age intact, and addresses changed but are still a valid combination. Other PII and PHI were altered, but the patient’s allergies, medications, encounters, and other health data remained intact. This masked data, although altered, is clinically relevant and safe to use in both development and testing environment. Customers can use this masked data in Amazon Sagemaker to build-test-deploy models, such as predicting patients mortality within 90 days after ICU discharge OR in Amazon QuickSight to create a population health monitoring dashboard.

Cleanup

Stop the DataMasque EC2 when data masking runs are not required.
Delete S3 bucket used for masked and unmasked data.
For out-of-place masking, you might want to delete your Source Bucket if it is no longer required.

Conclusion

In this post, Brian, Snehanshu, and I showed you how to mask healthcare data for regulatory compliance using Amazon HealthLake and DataMasque.

Adhering to de-identification regulatory requirements while ensuring the data retains usefulness for data consumers requires a specialized toolset. Data masking is an integral part of a healthcare organization’s data security strategy and the need for high quality, de-identified data is key for building new solutions that will improve care and healthcare delivery.

Next steps

Explore DataMasque’s solutions, available in AWS Marketplace.

AWS Marketplace