AWS Big Data Blog
How Eliza Corporation Moved Healthcare Data to the Cloud
February 2023 Update: Console access to the AWS Data Pipeline service will be removed on April 30, 2023. On this date, you will no longer be able to access AWS Data Pipeline though the console. You will continue to have access to AWS Data Pipeline through the command line interface and API. Please note that AWS Data Pipeline service is in maintenance mode and we are not planning to expand the service to new regions. For information about migrating from AWS Data Pipeline, please refer to the AWS Data Pipeline migration documentation. |
This is a guest post by Laxmikanth Malladi, Chief Architect at NorthBay. NorthBay is an AWS Advanced Consulting Partner and an AWS Big Data Competency Partner
“Pay-for-performance” in healthcare pays providers more to keep the people under their care healthier. This is a departure from fee-for-service where payments are for each service used. Pay-for-performance arrangements provide financial incentives to hospitals, physicians, and other healthcare providers to carry out improvements and achieve optimal outcomes for patients.
Eliza Corporation, a company that focuses on health engagement management, acts on behalf of healthcare organizations such as hospitals, clinics, pharmacies, and insurance companies. This allows them to engage people at the right time, with the right message, and in the right medium. By meeting them where they are in life, Eliza can capture relevant metrics and analyze the overall value provided by healthcare.
Eliza analyzes more than 200 million such outreaches per year, primarily through outbound phone calls with interactive voice responses (IVR) and other channels. For Eliza, outreach results are the questions and responses that form a decision tree, with each question and response captured as a pair:
<question, response>: <“Did you visit your physician in the last 30 days?” , “Yes”>
This type of data has been characteristic and distinctive for Eliza and poses challenges in processing and analyzing. For example, you can’t have a table with fixed columns to store the data.
The majority of data at Eliza takes the form of outreach results captured as a set of <attribute> and <attribute value> pairs. Other data sets at Eliza include structured data for the members to target for outreach. This data is received from various systems that include customers, claims data, pharmacy data, electronic medical records (EMR/EHR) data, and enrichment data. There are considerable variety and quality considerations in the data that Eliza deals with for keeping the business running.
NorthBay was chosen as the big data partner to architect and implement a data infrastructure to improve the overall performance of Eliza’s process. NorthBay architected a data lake on AWS for Eliza’s use case and implemented majority of the data lake components by following the best practice recommendations from the AWS white paper “Building a Data Lake on AWS.”
In this post, I discuss some of the practical challenges faced during the implementation of the data lake for Eliza and the corresponding details of the ways we solved these issues with AWS. The challenges we faced involved the variety of data and a need for a common view of the data.
Data transformation
This section highlights some of the transformations done to overcome the challenges related to data obfuscation, cleansing, and mapping.
The following architecture depicts the flow for each of these processes.
- The Amazon S3 manifest file or time-based event triggers an AWS Lambda function.
- The Lambda function launches an AWS Data Pipeline orchestration process passing the relevant parameters.
- The Data Pipeline process creates a transient Amazon EMR resource and submits the appropriate Hadoop job.
- The Hadoop job is configured to read the relevant metadata tables from Amazon DynamoDB and AWS KMS (for encrypt/decrypt operations).
- Using the metadata, the Hadoop job transforms the input data to put results in the appropriate S3 location.
- When the Hadoop job is complete, an Amazon SNS topic is notified for further processing.
Data obfuscation
To meet Eliza’s needs for protecting data privacy, the following business rule was created:
When dealing with PII (Personally Identifiable Information) and PHI (Personal Health Information) data in non-production environments, the PII must be obfuscated or masked before it can be shared with the development teams.
Considering the volume and velocity, the obfuscation itself becomes a big data problem.
Eliza’s obfuscation strategy relies on creating an obfuscation for each of the 18-20 known PII data elements (such as names, date of birth, telephone number, etc.). The metadata required for the obfuscation process is stored in DynamoDB. The following table shows the sample schema and data related to this process.
Some fields are obfuscated with dummy values, some fields with hash values. The fields which are present in the data file but not in this metadata table are not considered sensitive and are therefore not modified. The decision to hash some values allows these fields to join across multiple data sets as all similar fields across data sets are hashed using the same algorithm.
The mapper part of the process reads the metadata from DynamoDB and creates an obfuscated line by going through each field and applying the corresponding obfuscation. The KMS kmsKeyId value is used, along with the actual value, to add an additional layer of complexity for the hashing algorithm.
Mapper file snippet:
if(obfuscationType.equals("none")) newValue = originalValue; else if(obfuscationType.equals("hash")){ newValue=ObfuscationSet.toHash(originalValue,kmsKeyId); else if(obfuscationType.equals("dummyvalue")) newValue = dummyValue;
The obfuscation process is done per file and we chose to retain a one-to-one mapping of the original data file to the obfuscated output file.
Reducer file snippet:
input.repartition(1).saveAsTextFile(outputPath);
Data cleansing
The data received by Eliza is populated by disparate systems and can include free-form entries by consumers and customers. For example, a phone number can be entered as any of the following:
- (123) 456-7890
- 123.456.7890
- 123-456-7890
This brings additional challenges as the data may not enter with a standard format. An additional process has to be in place to cleanse the data and bring it to a common format.
At Eliza, most of the field formats were already known and we were able to bring the data to a common format using the data cleansing technique mentioned later. The following table shows a sample definition and values for the metadata created in DynamoDB.
The values in the InputRegex column define how the columns in different data sources should be treated. The schema structure allows you to apply multiple data cleansing rules on the same field and specifies the order of applying the rules.
Mapper snippet for the Spark job:
cleansedObject = DataCleansingUtil.INSTANCE.applyCleansingOnColumn(cleansingRules, attributes[i], i); DataCleansingUtil snippet input = DataCleansingUtil.INSTANCE.getCleansingString(input, (String)cleansingStep.get(Constants.INPUT_REGEX),(String)cleansingStep.get(Constants.OUTPUT_REGEX));
Data mapping
Mapping data allows you to combine data from multiple data sources efficiently.
At Eliza, in the current implementation, we solved the problem of providing a common view across data sources with a known metadata or schema structure. For example, the fields Zipcode, Zip, zip-code, zip_code, zip4, etc. coming from different programs and data sources refer to the same piece of information called “Zip Code”. Ontology provides a process to build the common view when combining data from these different sources.
At Eliza, based on the existing processes and knowledge of the current data sets, we were able to build a data mapping to consolidate fields across data sources.
The following DynamoDB table shows a sample schema and values for storing the mapping metadata. AttributeValue in DynamoDB corresponds to the common field name that is used across multiple data sets.
This table is read one time per data source and the information is stored in the source metadata. The source metadata is, in turn, read in the data processing Hadoop jobs while consolidating and transforming the data sets.
From the given sample, MEMBERLANGUAGE, MEMBER_LANGUAGE, and PRIMARY_LANGUAGE are treated as the same attribute, “Language”. The consolidated data has only a canonical representation of the attributes derived through “attributevalue” from mapping the metadata table.
Conclusion
An S3-based data lake is an architectural pattern strongly suited to the situation where data in an enterprise has a high variety and velocity with multiple consumption patterns.
Due to the sensitive nature of the highly regulated healthcare information handled by Eliza, NorthBay found Eliza in need of a real-life data lake implementation. In the course of implementing the AWS-based data platform for Eliza, we discovered that, due to the nature of the sensitive healthcare information, certain situations can occur and there are effective ways to deal with them. While the implementation details are specific to healthcare, the high-level design was purposely built to be generic and applied across industries and enterprises.
If you have questions or suggestions, please comment below.
—————————–
This content of this blog post will be included in an AWS Partner Webinar on Tuesday, October 18, 2016 featuring NorthBay and AWS. To register, click here.
—————————–
Related
Readmission Prediction Through Patient Risk Stratification Using Amazon Machine Learning