AWS Machine Learning Blog

Map clinical notes to the OMOP Common Data Model and healthcare ontologies using Amazon Comprehend Medical

Being able to describe the health of patients with observational data is an important aspect of our modern healthcare system. The amount of quantifiable personal health information is vast and constantly growing as new healthcare methods, metrics, and devices are introduced. All of this data allows clinicians and researchers to understand how the health of a patient changes over time and identify opportunities for precision treatment. The aggregate of this data informs epidemiologists about the health of populations and enables the identification of cause and effect patterns.

Unstructured text, commonly in the form of clinical notes, is a rich source of patient observational health data. Often, there is important information written by clinicians into notes that isn’t encoded into a patient’s structured health record. Clinical notes can also be used to help validate the quality of structured health record data. Historically, the challenge with clinical notes has been that they require manual review to extract the contained medical insights, which is time consuming and expensive.

Amazon Comprehend Medical is a natural language processing (NLP) service that uses machine learning to extract insights like medical condition, medication, dosage, and strength quickly and accurately. Customers can use Amazon Comprehend Medical in a pay-as-you-go model and immediately begin extracting insights from medical text without having to develop or train a complex machine learning model themselves.

The Observational Medical Outcomes Partnership (OMOP) Common Data Model, maintained by the Observational Health Data Science and Informatics (OHDSI) community, is an industry standard, open source data model used for health data. OMOP uses standardized medical ontologies, or “vocabularies,” like SNOMED, to store observational health data. From the OHDSI website:

“The OMOP Common Data Model allows for the systematic analysis of disparate observational databases. The concept behind this approach is to transform data contained within those databases into a common format (data model) as well as a common representation (terminologies, vocabularies, coding schemes), and then perform systematic analyses using a library of standard analytic routines that have been written based on the common format.”

A feature of OMOP is the ability to store clinical notes captured from disparate health data sources. In the data model, these notes are linked to an individual patient and a visit, giving context for the note. OMOP also has the ability to store insights inferred from notes by a natural language processing (NLP) engine. In this blog post we’ll explore how you can use Amazon Comprehend Medical to read notes from OMOP, extract medical insights, and write them back into OMOP using SNOMED ontological codes to enhance patient and population observational health data.

OMOP note processing architecture

This example works with clinical notes in the larger context of a full OHDSI architecture. You can learn more about our automated architecture for OHDSI on AWS by visiting this GitHub repository. The code given here specifically reads notes from and writes insights to the OMOP common data model. However, the same general process can be used with other structured data models for observational data that support clinical notes.

A primary feature of OMOP is the ability to consolidate data from many sources, like Electronic Health Record (EHR) systems and administrative claims data, from many geographic regions, into a common data model. In OMOP, the analysis of observational data from disparate sources is enabled by mapping dozens of source vocabularies (sometimes called dictionaries or ontologies) like SNOMED, RxNorm, ICD10, Read, LOINC, and others to an OMOP standard vocabulary. This enables OMOP to act as a kind of Rosetta stone to allow the interpretation of observational data between different sources and from different global regions.

One important aspect of the approach demonstrated here is mapping the entities detected by Amazon Comprehend Medical to the SNOMED ontology. Because observations must be expressed using standard codes in the OMOP data model, we must first map the insights detected by Amazon Comprehend Medical to a standard ontology before we can write them to OMOP. This is accomplished by sending entity text to the SNOMED CT Browser service to receive a SNOMED code for each entity. The following diagram shows the architecture for this process.


Using R code for notes processing

You can download the R Code demonstrating this process from GitHub.

This R code is demonstrated inside of the OHDSI on AWS architecture running on an Amazon EC2 instance running RStudio Server Open Source Edition instance. RStudio is a web-based, integrated development environment (IDE) for working with R. It can be licensed either commercially or under AGPLv3.

The code provided here can also be used from an environment outside of AWS to call Amazon Comprehend Medical. If you want to use it outside of AWS, you’ll need to configure a credentials file that allows this code to call the Amazon Comprehend Medical service. You can find instructions for doing that here.

At the very beginning of the notebook, you’ll see the following statements that define your OMOP CDM connection details and schema name. If you are using the OHDSI on AWS architecture, these are provided for you in a file called connectionDetails.R in your home directory. If outside of AWS, populate these lines with the details specific to your architecture.

#Source the DatabaseConnector::connect() call for my OMOP database
	connectionDetails <- DatabaseConnector::createConnectionDetails(
	dbms = "redshift",
	server = "",
	user = "master",
	password = "password")
cdmDatabaseSchema <- "CMSDESynPUF1k"

Amazon Comprehend Medical returns a confidence score for every entity, trait, and attribute that it detects. Next in the code, you’ll see a variable that defines a minimum confidence score that detections must meet in order to be written to the NOTE_NLP table. This minimum confidence score can be altered as appropriate for your use case.

#Set the minimum confidence score an inference must meet from Amazon Comprehend Medical to be added to the NOTE_NLP table.
min_score <- 0.80

The actual calls to Amazon Comprehend Medical are implemented using the AWS SDK for Python (boto3). An R library called “reticulate” is used to execute this Python code from within R, pass in the note text, and receive back the entity data detected by Amazon Comprehend Medical.

#Reticulate is used to run Python code from within R
#source the Python code to call Amazon Comprehend Medical
e <- environment()
reticulate::source_python('', envir = e)

Every note found in the OMOP NOTE table is read using a SQL query and processed. For each entity detected in each note by Amazon Comprehend Medical, an appropriate SNOMED code is determined by calling the SNOMED CT Browser.

#Pass the detected medical entity to the SNOMED REST interface to get the matching SNOMED code.
snomed_call <- paste(base,endpoint,"?","query","=", entity$Text,"&limit=1&searchMode=partialMatching&lang=english&statusFilter=activeOnly&skipTo=0 &returnLimit=1&normalize=true", sep="")
get_snomed <- GET(snomed_call, type = "basic")

As each note is processed you’ll see output to the R console that shows each record written to the NOTE_NLP table.

[1] "Writing note_nlp record, 441... Standard Concept ID:313217, Lexical Variant: atrial fibrillation, Term Modifiers: CATEGORY: MEDICAL_CONDITION; TYPE: DX_NAME; TRAIT: DIAGNOSIS;"
  |=======================================================================================================================================| 100%
Executing SQL took 0.287 secs
[1] "Writing note_nlp record, 442... Standard Concept ID:4300877, Lexical Variant: left, Term Modifiers: CATEGORY: ANATOMY; TYPE: DIRECTION;"
  |=======================================================================================================================================| 100%
Executing SQL took 0.258 secs
[1] "Writing note_nlp record, 443... Standard Concept ID:4298444, Lexical Variant: breast, Term Modifiers: CATEGORY: ANATOMY; TYPE: SYSTEM_ORGAN_SITE;"
  |=======================================================================================================================================| 100%
Executing SQL took 0.259 secs
[1] "Writing note_nlp record, 444... Standard Concept ID:4112853, Lexical Variant: breast cancer, Term Modifiers: CATEGORY: MEDICAL_CONDITION; TYPE: DX_NAME; TRAIT: DIAGNOSIS;"
  |=======================================================================================================================================| 100%
Executing SQL took 0.45 secs
[1] "Writing note_nlp record, 445... Standard Concept ID:4080761, Lexical Variant: right, Term Modifiers: CATEGORY: ANATOMY; TYPE: DIRECTION;"
  |=======================================================================================================================================| 100%

After the processing is complete, you’ll have a NOTE_NLP table filled with records that represent medical insights extracted from the notes in your NOTE table. Each of these records has an OMOP Standard Concept ID (the NOTE_NLP_CONCEPT_ID field), mapped from a SNOMED code, that represents the primary entity detected. The TERM_MODIFIERS field in the NOTE_NLP table is used to capture the category, type, and any attribute or trait data that Amazon Comprehend Medical detected about each entity.

The following image shows a comparison of some original portions of note text with entities detected by Amazon Comprehend Medical and the corresponding NOTE_NLP record written to OMOP.


This example shows how Amazon Comprehend Medical makes natural language processing for medical text easily accessible. The insights detected by Amazon Comprehend Medical enable data scientists, epidemiologists, and medical researchers to extend patient and population structured observational health records with information that is locked away in unstructured notes. This additional data provides important insight for precision medicine, longitudinal studies, clinical trial candidate assessment, and population health studies. The code given here can help you get started processing notes into the OMOP Common Data Model today, or it can be used as a pattern to process clinical notes into any observational health data model.

About the Author

James Wiggins is a senior healthcare solutions architect at AWS. He is passionate about using technology to help organizations positively impact world health. He also loves spending time with his wife and three children.