AWS Machine Learning Blog
Clinical text mining using the Amazon Comprehend Medical new SNOMED CT API
Mining medical concepts from written clinical text, such as patient encounters, plays an important role in clinical analytics and decision-making applications, such as population analytics for providers, pre-authorization for payers, and adverse-event detection for pharma companies. Medical concepts contain medical conditions, medications, procedures, and other clinical events. Extracting medical concepts is a complicated process due to the specialist knowledge required and the broad use of synonyms in the medical field. Furthermore, to make detected concepts useful for large-scale analytics and decision-making applications, they have to be codified. This is a process where a specialist looks up matching codes from a medical ontology, often containing tens to hundreds of thousands of concepts.
To solve these problems, Amazon Comprehend Medical provides a fast and accurate way to automatically extract medical concepts from the written text found in clinical documents. You can now also use a new feature to automatically standardize and link detected concepts to the SNOMED CT (Systematized Nomenclature of Medicine—Clinical Terms) ontology. SNOMED CT provides a comprehensive clinical healthcare terminology and accompanying clinical hierarchy, and is used to encode medical conditions, procedures, and other medical concepts to enable big data applications.
This post details how to use the new SNOMED CT API to link SNOMED CT codes to medical concepts (or entities) in natural written text that can then be used to accelerate research and clinical application building. After reading this post, you will be able to detect and extract medical terms from unstructured clinical text, map them to the SNOMED CT ontology (US edition), retrieve and manipulate information from a clinical database, including electronic health record (EHR) systems, and map SNOMED CT concepts to other ontologies using the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) if your EHR system uses an ontology other than SNOMED CT.
Solution overview
Amazon Comprehend Medical is a HIPAA-eligible natural language processing (NLP) service that uses machine learning (ML) to extract clinical data from unstructured medical text—no ML experience required—and automatically map them to SNOMED CT, ICD10, or RxNorm ontologies with a simple API call. You can then add the ontology codes to your EHR database to augment patient data or link to other ontologies as desired through OMOP CDM. For this post, we demonstrate the solution workflow as shown in the following diagram with code based on the example sentence “Patient X was diagnosed with insomnia.”
To use clinical concept codes based on a text input, we detect and extract clinical terms, connect to the clinical data base, transform SNOMED code to OMOP CDM code, and use them within our records.
For this post, we use the OMOP CDM as a database schema as an example. Historically, healthcare institutions in different regions and countries use their own terminologies and classifications for their own purposes, which prevents the interoperability of the systems. While SNOMED CT standardizes medical concepts with a clinical hierarchy, the OMOP CDM provides a standardization mechanism to move from one ontology to another, with an accompanying data model. The OMOP CDM standardizes the format and content of observational data so that standardized applications, tools and methods can be applied across different datasets. In addition, the OMOP CDM makes it easier to convert codes from one vocabulary to another by having maps between medical concepts in different hierarchical ontologies and vocabularies. The ontologies hierarchy is set such that descendants are more specific than ascendants. For example, non-small cell lung cancer is a descendent of malignant neoplastic disease. This allows querying and retrieving concepts and all their hierarchical descendants, and also enables interoperability between ontologies.
We demonstrate implementing this solution with the following steps:
- Extract concepts with Amazon Comprehend Medical SNOMED CT and link them to the SNOMED CT (US edition) ontology.
- Connecting to the OMOP CDM.
- Map the SNOMED CT code to OMOP CDM concept IDs.
- Use the structured information to perform the following actions:
- Retrieve the number of patients with the disease.
- Traverse the ontology.
- Map to other ontologies.
Prerequisites
Before you get started, make sure you have the following:
- Access to an AWS account.
- Permissions to create an AWS CloudFormation.
- Permissions to call Amazon Comprehend Medical from Amazon SageMaker.
- Permissions to query Amazon Redshift from SageMaker.
- The SNOMED CT license. SNOMED International is a strong member-owned and driven organization with free use of SNOMED CT within the member’s territory. Members manage the release, distribution, and sub-licensing of SNOMED CT and other products of the association within their territory.
This post assumes that you have an OMOP CDM database set up in Amazon Redshift. See Create data science environments on AWS for health analysis using OHDSI to set up a sample OMOP CDM in your AWS account using CloudFormation templates.
Extract concepts with Amazon Comprehend Medical SNOMED CT
You can extract SNOMED CT codes using Amazon Comprehend Medical with two lines of code. Assume you have a document, paragraph, or sentence:
First, we instantiate the Amazon Comprehend Medical client in boto3. Then, we simply call Amazon Comprehend Medical’s SNOMED CT API:
Done! In our example, the response is as follows:
The response contains the following:
- Characters – Total number of characters. In this case, we have 38 characters.
- Entities – List of detected medical concepts, or entities, from Amazon Comprehend Medical. The main elements in each entity are:
- Text – Original text from the input data.
- BeginOffset and EndOffset –The beginning and ending location of the text in the input note, respectively.
- Category – Category of the detected entity. For example,
MEDICAL_CONDITION
for medical condition. - SNOMEDCTConcepts – Top five predicted SNOMED CT concept codes with the model’s confidence scores (in descending order). Each linked concept code has the following:
- Code – SNOMED CT concept code.
- Description – SNOMED CT concept description.
- Score – Confidence score of the linked SNOMED CT concept.
- ModelVersion – Version of the model used for the inference.
- ResponseMetadata – API call metadata.
- SNOMEDCTDetails – Edition, language, and date of the SNOMED CT version used.
For more information, refer to the Amazon Comprehend Medical Developer Guide. By default, the API links detected entities to the SNOMED CT US edition. To request support for your edition, for example the UK edition, contact us via AWS Support or the Amazon Comprehend Medical forum.
In our example, Amazon Comprehend Medical identifies “insomnia” as a clinical term and provides five ordered SNOMED CT concepts and code that we might be referring to in the sentence. In this example, Amazon Comprehend Medical correctly identifies the clinical term as the most likely option. Therefore, the next step is to extract the response. See the following code:
The content of pred_snomed
is as follows, with its predicted SNOMED concept code, concept description, and prediction score (probability):
We have identified clinical terms in our text and linked them to SNOMED CT concepts. We can now use SNOMED CT’s hierarchical structure and relations to other ontologies to accelerate clinical analytics and decision-making application development.
Before we access the database, let’s define some utility functions that are helpful in our operations. First, we must import the necessary Python packages:
The following code is a function to connect to the Amazon Redshift database:
The following code is a function to run a given query on the Amazon Redshift database:
In the next sections, we connect to the database and run our queries.
Connect to the OMOP CDM
EHRs are often stored in databases using a specific ontology. In our case, we use the OMOP CDM, which contains a large number of ontologies (SNOMED, ICD10, RxNorm, and more), but you can extend the solution to other data models by modifying the queries. The first step is to connect to Amazon Redshift where the EHR data is stored.
Let’s define the variables used to connect the database. You must substitute the placeholder values in the following code within with your actual values based on your Amazon Redshift database:
Map the SNOMED CT code to OMOP CDM concept IDs
The OMOP CDM uses its own concept IDs as data model identifiers across ontologies. Those differ from specific ontology codes such as SNOMED CT’s codes, but you can retrieve them from SNOMED CT codes using pre-built OMOP CDM maps. To retrieve the concept_id
of SNOMED CT code 193462001
, we use the following query:
The output OMOP CDM concept_id
is 436962
. The concept ID uniquely identifies a given medical concept in the OMOP CDM database and is used as a primary key in the concept table. This enables linking of each code with patient information in other tables.
Use the structured information map from the SNOMED CT code to OMOP CDM concept ID
Now that we have OMOP’s concept_id
, we can run many queries from the database. When we find the particular concept, we can use it for different use cases. For example, we can use it to query population statistics with a given condition, traverse ontologies to bridge operability gaps, and extract the unique hierarchical structure of concepts to achieve the right queries. In this section, we walk you through a few examples.
Retrieve the number of patients with a disease
The first example is retrieving the total number of patients with the insomnia
condition that we linked to its appropriate ontology concept using Amazon Comprehend Medical. The following code formulates and runs the corresponding SQL query:
In our sample records described in the prerequisites section, the total number of patients in the database that have been diagnosed with insomnia are 26,528.
Traverse the ontology
One of the advantages of using SNOMED CT is that we can exploit its hierarchical taxonomy. Let’s illustrate how via some examples.
Ancestors: Going up the hierarchy
First, let’s find the immediate ancestors and descendants of the concept insomnia
. We use concept_ancestor
and concept
tables to get the parent (ancestor) and children (descendants) of the given concept code. The following code is the SQL statement to output the parent information:
In the preceding example, we used max_levels_of_separation=1
to limit concept codes that are immediate ancestors. You can increase the number to get more in the hierarchy. The following table summarizes our results.
concept_code | concept_name |
44186003 | Dyssomnia |
194437008 | Disorders of initiating and maintaining sleep |
SNOMED CT offers a polyhierarchical classification, which means a concept can have more than one parent. This hierarchy is also called a directed acyclic graph (DAG).
Descendants: Going down the hierarchy
We can use a similar logic to retrieve the children of the code insomnia
:
As a result, we get 26 descendant codes; the following table shows the first 10 rows.
concept_code | concept_name |
24121004 | Insomnia disorder related to another mental disorder |
191997003 | Persistent insomnia |
198437004 | Menopausal sleeplessness |
88982005 | Rebound insomnia |
90361000119105 | Behavioral insomnia of childhood |
41975002 | Insomnia with sleep apnea |
268652009 | Transient insomnia |
81608000 | Insomnia disorder related to known organic factor |
162204000 | Late insomnia |
248256006 | Not getting enough sleep |
We can then use these codes to query a broader set of patients (parent concept) or a more specific one (child concept).
Finding the concept in the appropriate hierarchy level is important, because if not accounted for appropriately, you might get wrong statistical answers from your queries. For example, in the preceding use case, let’s say that you want to find the number of patients with insomnia that is only related with not getting enough sleep. Using the parent concept for the general insomnia gives you a different answer than when specifying the descendant concept code only related with not getting enough sleep.
Map to other ontologies
We can also map the SNOMED concept code to other ontologies such as ICD10CM for conditions and RxNorm for medications. Because insomnia is condition, let’s find the corresponding ICD10 concept codes for the given insomnia’s SNOMED concept code. The following code is the SQL statement and function to find the ICD10 concept codes:
The following table lists the corresponding ICD10 concept codes with their descriptions.
concept_code | concept_name | vocabulary_id |
G47.0 | Insomnia | ICD10CM |
G47.00 | Insomnia, unspecified | ICD10CM |
G47.09 | Other insomnia | ICD10CM |
When we’re done running SQL queries, let’s close the connection to the database:
Conclusion
Now that you have reviewed this example, you’re ready to apply Amazon Comprehend Medical on your clinical text to extract and link SNOMED CT concepts. We also provided concrete examples of how to use this information with your medical records using an OMOP CDM database to run SQL queries and get patient information related with the medical concepts. Finally, we also showed how to extract the different hierarchies of medical concepts and convert SNOMED CT concepts to other standardized vocabularies such as ICD10CM.
The Amazon ML Solutions Lab pairs your team with ML experts to help you identify and implement your organization’s highest value ML opportunities. If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon ML Solutions Lab.
About the Author
Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he helps customers across different industries accelerate their use of machine learning and AWS Cloud services to solve their business challenges.
Miguel Romero Calvo is an Applied Scientist at the Amazon ML Solutions Lab where he partners with AWS internal teams and strategic customers to accelerate their business through ML and cloud adoption.
Lin Lee Cheong is a Senior Scientist and Manager with the Amazon ML Solutions Lab team at Amazon Web Services. She works with strategic AWS customers to explore and apply artificial intelligence and machine learning to discover new insights and solve complex problems.