Can AI Reason Like a Clinician? An Exploration with Arterial Blood Gas Analyses

Using Generative AI to Interpret Arterial Blood Gas Results – Background

The year 2023 was an exciting one for artificial intelligence (AI) in healthcare. Large language models (LLMs), also known as foundation models (FMs), have been fascinating, especially in medical applications. This project explores how LLMs like Anthropic Claude v2 can assist clinicians in interpreting a complex blood test in clinical settings. Anyone who has experimented with LLMs will likely agree that their effectiveness depends on the method of prompting, or asking questions. Prompt engineering is a technique in which a model’s input or query is designed to produce the optimal output. Prompt sculpting is a term created here to emphasize the allowance for added creativity and experimentation through trial and error.

Additionally, different LLMs yield varied outputs for the same prompt, and the outputs can even change when an LLM is repeatedly prompted with the same questions. This aspect is known as reproducibility. For healthcare applications, where accuracy is paramount, selecting the right model is critical. The model needs to have extensive training on medical data and demonstrate a reliable degree of reproducibility. Choosing the appropriate LLM for a specific task is akin to selecting a candidate for a job. A suggested approach is to use a tailored set of questions to assess its suitability for the task at hand. In this use case, the goal was to identify an LLM reasonably trained on medical data that demonstrated with appropriate prompting some degree of reproducibility and accuracy for interpreting arterial blood gas (ABG) results.

a) Understanding LLMs’ Strengths

Large language models are trained on extensive knowledge bases. The more they are specifically trained for a task, the more successful they are likely to be in that domain. Typically, the corpus of cleaned data on which the model is trained can run into several terabytes. For example, GPT-3 was trained on an estimated 45 terabytes of text data.

The complexity of a model is occasionally indicated by the parameters that are suffixed to an LLM’s name. For example, Llama 70B (70 billion parameters) is a more complex model than Llama 7B (7 billion parameters). However, additional computing power will be required to run 70 billion parameters. It is crucial to understand a model’s strengths and weaknesses before selection. One can start by interviewing models with relevant questions, such as inquiring about the causes of respiratory acidosis. One might present the results of a simple arterial blood gas test for interpretation and ask LLMs to explain their reasoning. This approach helps gauge each model’s capabilities and limitations for specific tasks. After an initial evaluation of models including OpenAI ChatGPT, Anthropic Claude, Meta Llama, Amazon Titan, and Google Bard, Anthropic Claude appeared to be the most effective at interpreting ABG data using an appropriate medical lexicon. That is, its interpretation was closest to a way a clinician would reason.

b) Understanding LLMs’ Weaknesses

Perhaps the biggest problem with LLMs is that they sometimes hallucinate, which means they return answers that are not true or are fabricated. Models generate responses based on learned patterns, making difficult distinctions between factual and fabricated information in their training data. This results in unreliability for providing accurate facts.

Another weakness of LLMs is their poor ability to perform mathematical calculations. Simple calculations, like checking if a number falls within a specific range, can be challenging for them. This is particularly problematic in medical settings, where many important values are given as ranges. If an LLM cannot accurately determine if a number is within a certain range, it could lead to errors in interpreting medical data. For example, an LLM might fail checking if 2.7 is in the range 1.5 to 2.5. So, creating a workaround might be required for such situations.

Solving the Challenge of Clinicians Needing to Interpret ABGs

Interpreting ABG tests is an important part of the job for clinicians, especially intensivists, hospitalists, nephrologists, anesthesiologists, and emergency physicians. This test is used daily or weekly to help diagnose medical conditions, assess the efficacy of treatments, and determine the next course of action or treatment. ABGs help in identifying complex conditions like mixed acid-base disturbances, which can be challenging to diagnose, such as in cases of poisoning. Complex ABGs can take 5 to 10 minutes or longer to interpret, depending on a clinician’s skillset and experience.

In summary, this project explores how LLMs like Claude v2 can assist in interpreting ABGs and diagnosing acid-base disorders. Such a tool could also serve as a valuable teaching aid for clinician trainees. This new technology not only offers a quick and accurate way to analyze these tests, but also provides insights for others building LLM-based decision support models for clinical use cases.

Using Amazon Bedrock to Assist in ABG Interpretation

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models accessible via a single application programming interface (API). This enables users to easily build and scale generative AI-based applications and allows for quick prototyping for a wide range of use cases.

Another key advantage of Amazon Bedrock is its integration with Amazon SageMaker, which allows for an effortless setup of compute instances. This feature, along with its user-friendly interface, makes Amazon Bedrock an excellent choice for managing and deploying LLM projects.

Initial Input and Initial Output

For this ABG use case, a detailed collection of ABG data from various clinical cases was first created. Each case included a primary and a secondary abnormality, covering a wide range of medical scenarios.

Next, ABGs were input into an LLM, specifically Claude v2, to see how the LLM interprets ABG values. With Claude v2, the initial results showed less than 50% accuracy in analyzing ABGs.

There are various methods that could potentially enhance an LLM’s performance. The methods described below were employed with an ultimate goal of improving the model’s accuracy in interpreting ABGs.

Improvement with Prompt Sculpting

Several small, more manageable steps were used to guide the LLM. The goal was to direct the LLM and make its responses more reproducible. For instance, when attempting to interpret partial pressure of carbon dioxide (pCO2) using an LLM, begin by noting the normal range, which is 35-45 mmHg in this case. Sometimes, it is necessary to further clarify how ranges need to be interpreted. For example, mention that a value is normal if it falls within the stated normal range: “If pCO2 is between 35-45mmHg, the pCO2 is ‘Normal.’”

Sequencing the steps numerically (1, 2, 3, etc.) helps an LLM understand the order of operations similar to a flowchart. Finally, an example of an interpreted ABG provided in the prompt can be useful for an LLM. To present such an example is known as one-shot or few-shot prompting, depending on the number of examples given.

Improvement with Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) is a technique that yields more relevant responses to queries. When prompted, the RAG process searches for relevant information within a user-established database. The technique then provides this chunked information to the large language model as context to assist in answering a question. Such an approach ensures that the LLM does not rely solely on its pre-existing knowledge. Responses are hence more pertinent when the context information is sourced from the user’s database. This is especially applicable when using general purpose LLMs for a specific use case like a medical query.

Imagine posing a question on respiratory acidosis as noted in an arterial blood gas test. If there is an established database with PDF documents on ABG interpretation, the RAG system would first retrieve information related to the query from this database. Here, it will gather information specifically pertaining to respiratory acidosis. The system then uses this information to generate a well-informed answer. This method enhances the reliability and depth of knowledge in the LLM’s answers, especially in relation to specialized information.

Improvement with a Math Scratchpad

To address performing mathematical calculations, some LLMs have natively generated a creative workaround. The LLM writes code in Python, which then performs the calculations. The results of executing this code are used to provide more accurate answers. This method effectively bypasses the LLM’s limitations in direct mathematical computations.

Similarly, a math scratchpad tool can be implemented, for example, using Python to reliably carry out required specialized domain-specific calculations. In this setup, a user first prompts the LLM to organize the input text data into a format that is easy for Python to read and write to a text file. For instance, JSON format-structured data is easily readable for computational purposes. JSON format allows keys to match values, for example, {“pH”: “7.2”, “pCO2”: “36”, “HCO3”: “17”}. It is crucial to make the data manageable for the subsequent computational steps. Alternatively, user-entered input data can be initially captured in a format that allows easy computation. For example, an individual box entry for each of pH, pCO2, and HCO3 values can be used.

Once the data is prepared and written to the math scratchpad, Python code can read the data and perform the necessary calculations. The results are then used to prepare a new prompt for the LLM. This enables the LLM to generate reliable answers based on the calculations performed by the Python script, thus overcoming its inherent mathematical limitations.

A two-step process takes time and requires longer token length (i.e., the length of input to an LLM). However, as LLMs improve, the token length and costs decrease over time.

Figure 1 shows a diagram that illustrates the overall process.

Figure 1. Workflow for using an LLM to analyze ABG results

Results

After integrating a retrieval-augmented generation database and refining prompt design, the large language model showed improved accuracy. Together, these modifications to the LLM’s decision architecture boosted accuracy from under 50% to over 85% on medical queries. Figure 2 shows an example of the LLM’s successful interpretation of a compensated acid-base disorder when given the corresponding ABG values. The next phase of the project will further optimize prompts and leverage the knowledge database with a goal to exceed 90% accuracy. Testing will determine if additional prompt adjustment can improve overall performance.

Figure 2. Example input to and output from the LLM called “ABG CONSULT BOT”

Conclusion

We found the process of experimenting with large language models for medical applications to be both fun and challenging. The accuracy of this promising technology can potentially be improved by leveraging techniques like retrieval-augmented generation, custom-written Python for mathematical capabilities (e.g., a math scratchpad), and creative prompt engineering (e.g., prompt sculpting).

The key points are: 1) LLMs show promise for being able to assist clinicians but need improved accuracy, 2) leveraging supporting techniques like RAG and prompt engineering can improve accuracy, and 3) exploring relevant medical use cases is likely to uncover additional ways to assist clinicians.

Read on for more healthcare stories. To learn more about AWS for Healthcare & Life Sciences—curated AWS services and AWS Partner Network solutions used by thousands of healthcare and life sciences customers globally—visit the AWS for Healthcare & Life Sciences and AWS Healthcare Solutions webpages.