Skip to content

AI for Medical Diagnosis: How Large Language Models Could Improve Clinical Reasoning

In This Article

  • Clinical reasoning can be a challenging process, where asking the right questions and performing the correct tests may determine how quickly and accurately a patient is diagnosed
  • Mass General Brigham researchers are investigating how artificial intelligence (AI) could improve the diagnostic process
  • Their research indicates large language models (LLMs), conversational AI tools, are currently the closest model to human clinical reasoning
  • They suggest LLMs could aid the efficiency and accuracy of medical diagnosis by establishing pretest probability, "double-checking" a clinician's reasoning, providing alternative diagnoses, and more

Medical diagnosis is one of the more challenging aspects of clinical practice. According to the National Academies of Sciences, Engineering, and Medicine, diagnostic errors represent a major public health problem that most people will experience at least once in their lifetime, some with devastating consequences.

Although much has changed in medicine over the centuries, the core basis of clinical reasoning remains the same: a patient presents with a concern, and the clinician, by gathering more information and utilizing their existing knowledge, attempts to determine what is causing the problem.

"If you imagine the patient is an iceberg, you start from the tip and work your way down below the surface to find the needle that's causing trouble for the patient," says Raja-Elie Abdulnour, MD, a physician-researcher in the Division of Pulmonary and Critical Care Medicine at Brigham and Women's Hospital. "You can see why it's very, very difficult."

Large language models (LLMs) like GPT-4, which powers ChatGPT, are conversational artificial intelligence (AI) tools particularly poised to aid clinical reasoning. Dr. Abdulnour and Daniel Restrepo, MD, a hospitalist at Massachusetts General Hospital, are studying the potential of LLMs to improve clinical diagnostic reasoning.

To Err Is Human: Current Gaps in Clinical Reasoning

Gaps in the clinical reasoning process could cause delayed or incorrect diagnoses for patients.

While Dr. Abdulnour and Dr. Restrepo emphasize there is room for error along each step of the diagnostic process, a clinician's interaction with and understanding of the patient is a key part of the process.

"There's a lot of room for error when the patient sees a clinician," Dr. Abdulnour says. "They may have difficulty communicating complaints due to a language barrier. The clinician may be tired, have their own biases, or isn't asking the right questions."

How medical students are taught diagnosis is another gap in the clinician-patient interaction. Dr. Restrepo notes that students enter their preclinical years with an immense amount of information, and little sense of how to apply it.

"We're not taught to reason as much as we're taught to memorize," says Dr. Restrepo. "Education rewards instant associations rather than the mental gymnastics it takes to identify atypical presentations of common diseases, which are more likely than perfect presentations of uncommon disease."

This reliance on teaching knowledge rather than process invites clinicians to lean into mental shortcuts, potentially overlooking important aspects of the evidence. A further challenge is the volume of medical knowledge is now estimated to double every 73 days—an overwhelming amount of information for clinicians to understand and implement into their practice.

Researching How AI Could Improve Medical Diagnosis

Dr. Abdulnour and Dr. Restrepo say that GPT-4 and other LLMs are currently the closest models to human clinical reasoning and have the potential to improve the diagnostic process.

"A key barrier to excellence in human clinical reasoning is the vast amount of medical knowledge one must learn and maintain, and the need to recall a specific portion of that knowledge when seeing a patient," says Dr. Abdulnour. "LLMs model the human ability to memorize and process information so well that they may be the tool that helps us break that barrier to improve diagnosis and patient care."

One potential benefit, the researchers say, is to help clinicians consider additional options and provide second opinions. If initial hypotheses don't pan out, the clinician could ask an LLM to double-check their work and see what they might have missed. It may suggest a diagnosis the clinician would not typically consider.

Another element of support AI could provide involves pretest probability—determining how likely a specific disease may be before testing for it.

"AI mines our local information, so it can give us a granular sense of the prevalence of disease in a specific setting," Dr. Restrepo says. "I could ask, 'At the Mass General Emergency Department, what were the diagnoses of the last 600 patients who presented with fever?' With that information, I'd have a much better sense of a diagnostic starting point."

Through their research, Dr. Restrepo and Dr. Abdulnour are focused on two objectives. One, learn the best way to interact with an LLM to ensure it provides the safest, most accurate answers. Developing appropriate prompts is particularly important, as they posit that the LLM's outputs are only as good as what is inputted into the model.

The second objective is understanding response quality and potential bias, so clinicians properly interpret the LLM's responses. There are limitations in current models, largely due to the information most are trained on—the entirety of publicly available written text.

"An LLM can 'hallucinate' false information, possibly because it learned incorrect data," Dr. Restrepo notes. "But also, because it's taught to make something sound like a human said it. It will sound really convincing, but it may have misread a paper or invented a citation. So there are still real concerns as we think about rolling it out for clinical care."

However, he emphasizes that the technology is rapidly evolving; updated versions of GPT-4, for example, have been shown to hallucinate less than previous models. Software companies are placing more emphasis on how the LLM is trained by keeping the data more focused, and models specific to healthcare are in development.

The Path Forward: Establishing an AI Code of Conduct

Placing guardrails around the use of AI and LLMs, particularly in clinical settings, will help ensure the technology is used in a safe and effective manner across the board. Perhaps most notably, the National Academy of Medicine is spearheading a Health Care Artificial Intelligence Code of Conduct effort with multiple institutions, including some software companies.

"We need to demonstrate that it's safe, effective, and efficient before implementing into clinical care," says Dr. Abdulnour.

Their research remains dynamic and multi-institutional as the technology continues to evolve. But with all the potential AI holds in the clinical space, Dr. Restrepo and Dr. Abdulnour stress that it will never replace clinicians but only serve to augment them.

"Data collection is still very much within our human hands. We still need to be empathetic about how we teach and how we practice," says Dr. Restrepo. "But it is impressive, it is coming, and if we are proactive about it, AI will be incredibly helpful to us."

Learn more about research at Mass General Brigham

Explore artificial intelligence research at Mass General

Related topics

Related

Part of the promise of artificial intelligence (AI) is its seeming ability to produce objective, unbiased results. However, AI is not immune from the potential for bias. This Q&A explores how and where bias can enter the picture in science and healthcare and what can be done about it.

Related

In the first study of its kind, Tsung-Chun Lee, MD, PhD, Braden Kuo, MD, and colleagues demonstrated that a conversational AI (ChatGPT) can provide easy-to-understand, scientifically adequate, and generally satisfactory answers to common questions about colonoscopy as determined by gastroenterologists.