AWS for Industries

Reduce Lab Code Matching Efforts with hc1 on AWS

Blog guest authored by Charlie Clarke, Yanni Pandelidis, and Bill Robinson from hc1

A back-office challenge between provider networks and diagnostic testing organizations is the sharing of common service, test, and procedure names and codes. Historically, different organizations have had their own naming and coding catalog, also referred to as a compendium. The inconsistent representations of the same tests, services, or procedures are a challenge to organizations in providing services to each other. These differences create inaccurate interpretations of requests that can impact patient care and drive up costs for both organizations.

In this article, a specific use case associated with compendium matching between a diagnostic lab and their customers is presented along with an innovative solution developed by hc1 to address the need. The solution hc1 provides uses machine learning to develop models that can accurately and quickly match an inbound set of diagnostic test names and codes to a lab’s set of testing procedures. This process can prepare requests for proposals (RFPs) and later map incoming requests accurately to the correct diagnostic test procedure. hc1’s Compendium Management solution has reduced by 60–70 percent the manual effort associated with mapping hospital and lab network-specific lab codes to internal lab coding, while also improving the accuracy of code matching.

hc1 Compendium Management solution


Diagnostic labs use matching against enterprise compendia to drive many activities, such as pricing proposals for new diagnostic lab customers, onboarding new lab customers to their internal systems, and the day-to-day operations of fulfilling new service requests. Hospital systems and laboratories often perform these matching activities manually using valuable subject matter experts. Furthermore, these experts are often armed with incomplete or outdated versions of compendia, spreadsheets with previous matches, and online test catalogs to find all the information necessary to determine a match. The result is that the business-critical task of matching test compendia is time-consuming and has a high potential for inaccuracies, which impact many aspects of the diagnostic lab’s activities.

An hc1 customer faced this challenge in pricing operations. They needed to know how to quickly, accurately, and efficiently respond to diagnostic lab customer RFPs given limited information about the prospect’s current testing catalog. Utilizing Amazon Web Services (AWS), hc1 implemented proprietary artificial intelligence/machine learning models to address the issues of inconsistent and limited data to provide accurate matching. By normalizing the inputs, then enriching the input data, and applying natural language processing (NLP) techniques, hc1 created a streamlined tool for matching customer inputs to appropriate tests. Using this solution, a lab can accurately and efficiently process the largest requests in a matter of minutes.

hc1 Compendium Management solution high level data flow


A diagnostic test has a name and consists of one or more observations or analytes. These analytes can have names, units, reference ranges, and codes. The diagnostic test is typically performed on a single specimen using a specific methodology. The test is also assigned a procedure code for billing. With so many features to compare, matching them to each other would seem to be straightforward.

However, in the example of an RFP, a customer will request pricing for their outsourced diagnostic testing. As input, they will provide:

  • A test ordering code
  • The current name of the diagnostic test
  • Possibly Current Procedural Terminology (CPT) code(s)
  • Monthly volume

Details like specimen, methodology, and related analytes are not necessarily available. The names given might be oddly abbreviated, truncated, or even misspelled. The challenge created by these inputs is how to maximize what can be inferred from the test names and other input values.

The test names need to be normalized first so they can be compared with each other more accurately. To overcome these challenges, hc1 leveraged a diverse set of common abbreviations mined from hundreds of thousands of tests extracted from over a billion lab orders. The hc1 team assembled lists of common abbreviations and acronyms to drive the normalization task. The team also used sources such as Logical Observation Identifiers Names and Codes (LOINC) and CPT to add to their vocabularies.

Additional challenges exist when the test name includes other data features like specimen and methodology. Often a test name will include a specimen like “urine” or “serum.” The solution hc1 provides pulls that out of the name and uses it as a separate feature. The same is true of methodology. Terms like polymerase chain reaction (PCR) and liquid chromatography-mass spectrometry (LC-MS) are removed from the test name and treated as a new feature. With test name, specimen, methodology, and CPT treated as separate features, hc1 was able to start matching tests between different compendiums more accurately than on the test name alone.

During the modeling process, challenges associated with specific test types were uncovered. Features that only differed by small amounts, such as a single letter, were scored similarly and created inaccuracies in the results. For example, the difference between “Almond Food IgE” and “Almond Food IgG” is a single letter to the model. Even expanding to “Immunoglobulin E” vs. “Immunoglobulin G” was just a one letter difference, and therefore the model scored them as similar. Recognizing that certain keywords like “IgE” and “IgG” were very different in context, hc1 addressed this in the preprocessing. The team created lists of context-aware terms and enhanced the model to weigh terms appropriately. The same steps were followed for other tests, such as “Vitamin A” vs. “Vitamin C” or “Factor V” vs. “Factor X,” where the names were similar, but the one difference was significant.

Even CPT codes posed a particular challenge. A CPT code is a procedure code standard used for billing. However, it is not one-to-one with tests. The hc1 team found that a CPT used to bill for hundreds of different tests, such as allergens, was not highly valuable, but a CPT used for only one test is highly valuable. They created a dynamic value that can boost a matching score based on the number of tests with the same CPT in the target compendium. This helped prevent over-valuing common CPT codes while helping those codes with just one or two related tests to be valued accurately.

Not every matching project starts with the limited information in an RFP. Clients also need to perform matching for customer onboarding and lab operations where observation details are provided. For diagnostic testing, the observations include analytes, measurements, volumes, and examinations of a specimen. A test has one or more of these observations, and the set of observations impacts the accuracy of a match.

One standard that works quite well is LOINC for codifying the observation values. LOINC considers the component, type of property, interval of time, specimen type, scale, and methodology. Many labs do assign LOINC codes to their observation records, but it is still not universally adopted. Even the labs that do use LOINC codes will not always have the same set of observations associated at the test level.

Taking many of the concepts used for the test name processing, hc1 applied the same style of rules to the analyte names to normalize where possible, remove the keywords to their own feature value, and use a standard list of analyte units. If the source and target both provided LOINC codes, the match was treated as exact. The analytes from the input compendium are then matched against all analytes in the target compendium to find the best matches. The set of analytes can then be compared to sets of analytes on other tests. This level of detail allows hc1 to be confident that the test is a structural match at an analyte.


The hc1 matching solution created to address the challenges in compendia matching consists of sequential processes that define sets of operations on the input and reference compendium files. These subprocesses are: preprocessing, vectorization, distance computation, and matching.

Matching steps

In preprocessing, the input and compendium files go through a sequence of NLP operations to clean and standardize the text columns. The NLP operations include string-to-token transformation, spelling correction, stop-word elimination, alternative label to preferred label transformation, and named entity recognition. These operations leverage multicore processing using PySpark, a Python interface to the Apache Spark open-source platform for multicore computing. Many advanced NLP operations in Spark are implemented in a transformation pipeline through the Spark NLP library, which is a state-of-the-art open-source Spark-based platform. The transition from single-core computing based on python to multicore computing based on PySpark leads to a dramatic improvement in speed—in some cases from hours to minutes.

The vectorization subprocess consists of transforming every text string into an individual set of text tokens and mathematical embedding that transforms the multicolumn tokens into a multidimensional vector representation. Following vectorization, a dynamic weighting algorithm is defined to compute distance measures that relate each unlabeled input entity to each labeled reference compendium entity in terms of their similarity. The parameters and hyperparameters of the algorithm are a function of the actual columns available in the particular matching process, existence or absence of null values, and also reflect the importance and reliability of each column to the matching process.

Given the vectorized entities and the corresponding distance measures, a matching algorithm is employed to produce an ordered list of matches for each input entity based on the distance measure. Implementation of the matching algorithm uses serverless AWS Lambda functions for scalability and performance enhancement.

Process development includes a phase that optimizes the hyperparameters of the training process based on accuracy through a multiplicity and variety of input and reference files. The expected performance of the matching process depends not just on matching process algorithms and parameters, but also on the quality and richness of the input files. For example, files that contain only lab order name information have a lower expectation of accuracy than files that depend on additional columns, such as analytes, specimen type, methodology, and CPT Codes. Similarly, matching between business units is expected to have a higher performance than matching to external data sets. Continuous improvement is achieved through periodic hyperparameter optimization and algorithmic process adaptation based on close collaboration with the business and med-tech domain experts. Historical trusted mappings and marked feedback files, obtained through this collaboration, are critical sources of increasing knowledge and optimization for the matching solution.

Process development includes algorithm and model development by the data science team working in the Amazon SageMaker Studio integrated development environment. Once ready for deployment, production runs are deployed through an automation process and interact with a user interface, which allows for the direct submission of business domain expert user files.

The hc1 general data solution utilizes multiple Amazon Virtual Private Clouds to host the data ingestion, database, machine learning, application, and management components of the hc1 solutions, including compendia management. Amazon Relational Database Service and Amazon Simple Storage Service are used to store the hundreds of terabytes of clinical data ingested and processed at hc1 on behalf of their customers. Amazon Elastic Compute Cloud and containers on AWS drive a dynamic and highly scalable compute infrastructure supporting the insights delivered to hc1’s customers for both solutions. These services, coupled with the suite of AWS Cloud Security services, allow hc1 to provide solutions safely and reliably to all of their customers.


The hc1 Compendium Management solution built on AWS enables accurate and efficient compendia matching capabilities for use cases, including pricing, customer onboarding, and operations. Utilizing hc1’s industry expertise coupled with hc1’s hundreds of terabytes of clinical data and machine learning skills, the Compendium Management solution automates the heavy lifting associated with critical compendia matching tasks.

The hc1’s Compendium Management solution has reduced by 60–70 percent the manual effort associated with mapping hospital and lab network-specific lab codes to internal lab coding, while also improving the accuracy of code matching.

To understand how hc1 can apply its lab management and operation solutions to address your use cases, please visit to start the discussion around your needs.

About hc1

hc1 is a leader in critical insight, analytics, and solutions for precision health. The hc1 Precision Health Cloud organizes volumes of live data, including lab results, genomics, and medications, to deliver solutions that ensure that the right patient gets the right test and the right prescription. Today, hc1 powers solutions that optimize diagnostic testing and prescribing for millions of patients nationally. To learn more about hc1’s proven approach to personalizing care while eliminating waste for thousands of health systems, diagnostic laboratories, and health plans, visit and follow us on Twitter, Facebook, and LinkedIn.

Charlie Clarke serves as the Sr. VP of Technology for hc1. He’s worked in the health IT world for over 25 years, helping to build technology that improves lives. When not solving complex healthcare challenges, Charlie enjoys the outdoors. You’ll often find him fishing the small lakes and streams of Indiana from his kayak.




Yanni Pandelidis, Ph.D. is a data scientist for hc1Yanni Pandelidis, Ph.D. is a data scientist for hc1. He has AWS Machine Learning Expert certification and, in the past, has taught AI, machine learning, and engineering. His experience in the health domain spans over 25 years, where he has served in various leadership and technical roles. He loves playing basketball, dancing salsa, and performing as a singer and guitarist.






Bill Robinson is the VP of Data Engineering at hc1Bill Robinson is the VP of Data Engineering at hc1. He’s worked with clinical laboratory data and systems for over 18 years. He enjoys the challenge of finding ways to improve patient care with data-driven solutions. Outside of work, you will find Bill and his family visiting as many National Parks as possible. He also spends many hours a week with his sons practicing Taekwondo at a local martial arts school.

Harvey Ruback

Harvey Ruback

Harvey Ruback is a Senior Partner Solution Architect on the Healthcare and Life Sciences team at Amazon Web Services. He has over 25 years of professional software development and architecture experience in a range of industries including speech recognition, aerospace, healthcare, and life sciences. When not working with customers, he enjoys spending time with his family and friends, exploring his new home state of New York, and working on his wife’s never-ending list of home projects.