AWS Machine Learning Blog
Identifying and working with sensitive healthcare data with Amazon Comprehend Medical
At AWS, I regularly speak with AWS customers and AWS Partner Network (APN) partners about how they are using technology to transform human health. These companies often generate large amounts of health data that they use in a variety of applications, such as population health management and electronic health records. Developers need to find ways to use the valuable medical information in these applications while meeting their compliance obligations with regard to sensitive data, such as protected health information (PHI). Some applications where our customers and APN partners are doing this today are clinical decision support, revenue cycle management, and clinical trial management.
There are multiple methods to mask data, and each organization has their own approaches based on internal risk assessments. We recommend that you consult risk assessment specialists for your organization’s specific implementation process. Typically, data is masked in two steps. First, PHI must be identified. Then, an algorithm is used that either anonymizes or de-identifies the data, usually in accordance with Safe Harbor or expert determination. This approach lends itself to using a state machine to apply the business logic your organization requires for each step independently and pass the information between states.
In this blog post, I’ll demonstrate how you can use a combination of Amazon Comprehend Medical, AWS Step Functions, and Amazon DynamoDB to identify sensitive health data and help support your compliance objectives. I’ll then discuss some potential extensions of the architecture that are patterns customers often adopt.
The architecture
This architecture uses the following services:
- Amazon Comprehend Medical to identify entities within a body of text
- AWS Step Functions and AWS Lambda to coordinate and execute the workflow
- Amazon DynamoDB to store the de-identified mapping
This architecture and the code that follows are available as an AWS CloudFormation template.
The individual components
Like many modern applications being built on AWS, the individual components within this architecture are represented as Lambda functions. In this blog post, I’ll show you how to build three Lambda functions:
- IdentifyPHI: Uses the Amazon Comprehend Medical API to detect and identify PHI entities from a body of text, such as a medical note.
- MaskEntities: Takes the entities from IdentifyPHI as input and masks them in the body of text
- DeidentifyEntities: Takes the entities from IdentifyPHI and applies a hash to each entity and stores that mapping in DynamoDB.
Let’s walk through each in turn.
Identify PHI
The following code reads in a JSON body, extracts PHI entities from the message, and returns a list of extracted entities.
The workhorse in this Lambda function is the Amazon Comprehend Medical DetectPHI API call, which returns a list of entities that Amazon Comprehend Medical identifies. Note that confidence scores are provided with each identified entity – these scores indicate the level of confidence in the accuracy of identified entities. You should take these confidence scores into account and review identified entities output to make sure they are correct. For more information on the returned data structure, see the DetectPHI documentation.
Mask entities
There are multiple approaches to masking a message. In this example, we take each entity and replace it with a series of pound signs (#) corresponding to the length of the entity. The output is the message that has been input with each entity masked. You could choose whichever methods that are most meaningful to and appropriate for your business. For example, if there are multiple NAME PHI entities, you could order them as NAME1, NAME2, and so on.
Here’s the Lambda function:
De-identify entities
There are multiple methods for de-identification. The example described in this blog post is meant to demonstrate one way you can de-identify sensitive entities so that they can be reidentified later on by a user with the appropriate permissions. Here, we do several steps:
- Apply a salt to the entity.
- For each entity, generate a sha3-256 hash of the salted entity. Store this entity in a dictionary.
- Replace each entity in the message with the hash generated in step 1.
- Generate a sha3-256 hash of the de-identified message.
- Store the entities in DynamoDB with the hashed message as the hash key and the entity hash as the range key.
Here is the Lambda function for this step. The EntityMap, which is a DynamoDB table, is read in as an environment variable:
Building the Boto3 Lambda layer
Next, we’ll create a Lambda layer containing Boto3. This is a common best practice when deploying Lambda functions in production.
Copy and paste the following code into a terminal. Feel free to change boto3env to a folder of your choice. The following example uses Python 3.6.
Note the LayerVersionArn in the output. We’ll use this shortly.
Building the state machine
The multiple steps within this workflow, such as data passed between steps and forking paths based on user input, can be best represented as a state machine. We’ll use AWS Step Functions to define the state machines and execute the individual Lambda functions.
The state machine reads in a JSON blob containing the message text to process as well as whether to mask or de-identify the message. The overall steps are:
- Identify PHI entities using Amazon Comprehend Medical APIs.
- Determine whether to mask entities or de-identify.
- Based on results of Step 2, act accordingly.
Here is the Amazon States Language code defining this state machine:
Testing the state machine
As mentioned in the introduction, you can deploy the entire architecture using AWS CloudFormation. Launch the CloudFormation template now:
Use the LayerVersionArn output that you noticed previously in the Boto3LayerArn CloudFormation parameter.
After the CloudFormation stack deploys, you should have the following resources:
- The three Lambda functions
- A DynamoDB table containing mappings to the re-identified entities
- A Step Functions state machine
- AWS Identity and Access Management (IAM) resources
Let’s take a fictional medical note, or rather a combination of what would be several notes, which was provided by the Amazon Comprehend Medical team. Notice that it’s filled with typos, which would present challenges for rules-based approaches for entity identification.
Stay Free Medical Center
Emergency Department
Clinical Summary
12341 W. Bohannon Rd, Grantville, GA
Phone: (770) 922-9800PERSON INFORMATION
Name: SALAZAR, CARLOS
MRN: RQ36114734
ED Arrival Time: 11/12/2011 18:15Sex: Male
DOB: 2/11/1961
Age: 50 Years
Visit Reason: New onset A Fib, SOB
Acuity: 2 Emergent Disposition: Home/Self-Care
Address: 186 VALETINE, NE 69201
Phone: 402 213-2221SUBJECTIVE:
Carlos came to the ED via ambulance accompanied by son, Jorge. He is a 50 yo male who was working at Food Corp when he had sudden onset of palpitations. Carlos stated his fater, Diego, also had palpitations through his life.Provider Contact Time: 11/12/2011 19:00
Decision to Admit: Not entered
ED Departure Time: 11/23/2011 00:07DIAGNOSIS: Hyperthyroidism
Attending Provider:
Saanvi Sarkar, MDPrimary Nurse(s):
Jackson; MateoFill New Prescriptions:
nepafenac (nepafenac 1 mg / 1mL Ophthalmic Suspension) 1 drop left eye every 12 hours 14 day(s)
zofran (Ondansetron 4 mg oral tablet) 4 mg ORAL PRN
atropine sulfate 0.05 mcg / hyopscyamine sulfate 3.1 mcg / phenobartbital 48.6 MG / scopolamine hydrobromide 0.0195 mg ( Donnata ER oral tablet) 1 table PO PRN
acetaminophen – hydrocodone ( Vicodin 5 mg – 500 mg oral tablet ) 2 tablet(s) by Mouth every 6 hours as needed for pain
docusate sodium 100 mg oral capsule 100 mg by Mouth twice daily as needed for constipationAllergies:
penicillins
ibuprofen
bee pollenPatient Education and Follow-up Information
Instructions:
ED, Nausea (Custom)
Follow up:With:
Address:
When:Return to Emergency Department
Comments:
Nausea Vomiting
Nausea persists without control from anti-nausea medications Projectile vomiting Uncontrolled , consistent nausea & vomiting Blood or “coffee grounds” appearing material in vomit Medicine not kept down because of vomiting Weakness or dizziness along with nausea/vomiting Severe stomach pain while vomiting
Pain
Severe Chest / Arm pain Severe squeezing or pressure in chest Severe sudden headache
New or uncontrolled pain New headache Chest discomfort Pounding heart Heart “flip – flop” feeling Painful Central Line site or area of “tunnel” Burning in chest or stomach Pain or burning while urinating Pain with infusion of medications or fluids into Central LineDiarrhea
Constant or uncontrolled diarrhea New onset diarrhea Diarrhea with fever and abdominal cramping Whole pills passed in stool Greater than 5 times each day Stool which is bloody , burgundy or black Abdominal cramping
Fatigue
Unable to wake
Dizziness Fatigue is getting worse Too tired to get out of bed or walk to the bathroom Staying in bed all dayFever / Chills
Shaking chills , temperature may be normal Temperature greater than 38.3° C or 100.9° F by mouth Fever greater than 1 degree above usual if on steroids 24 Cold symptoms ( runny nose , watery eyes , sneezing , coughing )
With:
Address:
When:Follow up with primary care provider
Comments:
Call tomorrow to make an appointment for the next 1-2 days and to start arranging PCP follow-up
Thank you for visiting the Stay Free Medical Center.
Comments:
Call tomorrow to make an appointment for the next 1-2 days and to start arranging PCP follow-up
Thank you for visiting the Stay Free Medical Center.
The input to the state machine takes two values. First, the note. Second, a choice of whether to anonymize the note or de-identify it. In this example we’ll de-identify the message. Here’s what that looks like:
In the AWS CloudFormation console, navigate to the output page and note the state machine Amazon Resource Name (ARN), you will be using it later to invoke a state machine execution.
You can test using the AWS CLI, your AWS SDK of choice, or the AWS Step Functions console. The following command shows what it would be like if you used the CLI. However, before you type the following command, copy the previous JSON and save it to example_note.json. Also replace the AWS Step Functions state machine ARN with the ARN in the CloudFormation output.
The overall execution should take only a couple of seconds. Let’s navigate to the AWS Step Functions console to see what happened.
When you ran the previous command, several things happened.
- A Lambda function identified potential PHI entities within the note.
- These entities were salted and the resulting combination was hashed using SHA3-256.
- The hashes replaced the original entities in the message and the updated message was then hashed.
- The mappings were stored in DynamoDB.
- The hashed message is returned as the output of the execution.
You can view the output from the steps in the AWS Step Functions console. The previous message should now look like the following (formatted for ease of reading). The de-identified message still contains valuable information that can be used, but the sensitive data has been masked using the previous masking example.
Here’s what the table looks like after two runs with the same message.
Because each entity is salted, there’s no way of mapping that hash back to the original entity without using the DynamoDB mapping table, which you can notice by repeated entities having different hashes due to salting. Additionally, since you can manage DynamoDB access using IAM, you can control who has access to the items in your table. You can then use AWS CloudTrail to audit reads from your table containing sensitive information.
Conclusion and next steps
Protecting sensitive data is always job zero for healthcare organizations. In this blog post, I demonstrated how you can use Amazon Comprehend Medical to work with and identify protected health information. While organizations have different approaches to protect sensitive data, they follow the same architectural pattern: (1) identify the sensitive entities, and (2) apply the appropriate protection strategy for the sensitive entities as defined by your organization. A state machine is well-suited to orchestrate the two steps.
There are additional modifications you can make to this architecture to suit your needs. Here are a few ideas:
- Put the state machine behind Amazon API Gateway to add an authorization layer to process your text, as well as a gateway to the individual Lambda functions.
- Filter by the confidence of the DetectPHI call. Amazon Comprehend Medical entities have a
Score
field in addition toText
. You can apply a threshold to filter the calls by, depending on your business requirements. - Use DetectPHI in conjunction with DetectEntities to help you detect and identify PHI, and also extract non-PHI entity relationships, which can be used for downstream analytics.
Interested in learning more about Amazon Comprehend Medical?
- Check out the documentation
- Explore related blog posts discussing Amazon Comprehend Medical
Coming to HIMSS? Meet the AWS Healthcare team live at HIMSS19 Booth #5058!
We welcome your questions and comments. We look forward to hearing from you!
About the Author
Dr. Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at Amazon Web Services. He works with ISVs and SIs to architect healthcare solutions on AWS, and bring the best possible experience to their customers. His passion is working at the intersection of science, big data, and software. In his spare time, he’s exploring the outdoors, learning a new thing to cook, or spending time with his wife, son, and his dog, Macaroon.