Skip to content

ChatGPT Performs Well in Answering Common Patient Questions About Colonoscopy

Key findings

  • ChatGPT is a large language model (LLM) artificial intelligence (AI), based on natural language processing technology, that provides a conversational written response to a question
  • This study examined the quality of ChatGPT-generated answers to common questions about colonoscopy, comparing them with answers on the websites of three top-tier hospitals
  • Despite little overlap in text between AI and non-AI answers, gastroenterologists rated AI answers as superior or comparable to hospital websites in ease of use, scientific adequacy and overall satisfaction
  • Conversational AI programs have a potential role in optimizing communication between patients and health care providers, especially for high-volume procedures such as colonoscopy

ChatGPT (OpenAI, San Francisco, CA), a large language model (LLM) artificial intelligence (AI), was released in November 2022. ChatGPT is based on natural language processing technology and provides a conversational written response to a question.

Intriguingly, Massachusetts General Hospital researchers have shown ChatGPT can generate credible medical information in response to common patient questions about colonoscopy. Tsung-Chun Lee, MD, PhD, of the Division of Gastroenterology and Hepatology at Taipei Medical University in Taiwan, Braden Kuo, MD, director of the Mass General Center for Neurointestinal Health, and colleagues report in Gastroenterology.

Methods

The team compared ChatGPT with the websites of three hospitals randomly selected from U.S. News & World Report's top 20 list for gastroenterology and GI surgery. They retrieved eight common questions about colonoscopy from the websites and used them as prompts for ChatGPT (January 30, 2023 version):

  • Q1: What is a colonoscopy?
  • Q2: Why is a colonoscopy performed?
  • Q3: How to prepare for a colonoscopy?
  • Q4: What to expect during the colonoscopy procedure?
  • Q5: What to expect after the colonoscopy procedure?
  • Q6: What to do after a negative colonoscopy result?
  • Q7: What to do after a positive colonoscopy result?
  • Q8: What to expect about complications?

To evaluate the consistency of answers, the researchers entered each prompt twice on the same day and recorded the answers as AI1 and AI2.

Text Similarity

Using CopyLeaks (ordinarily used as plagiarism detection software), the team compared the overlap between answers:

  • The similarity of ChatGPT responses to answers on a hospital website answers was 0% except in two cases where it was still extremely low (3% or 16%)
  • Similarity between AI1 and AI2 ranged from 28% to 77%, except for answers to Q7 (0%)

Gastroenterologist Ratings

Four gastroenterologists (two senior physicians, two fellows) rated the quality of all answers, blinded to their sources, on seven-point Likert scales:

  • Ease of understanding—For all eight questions, the mean ChatGPT score was statistically similar to the mean score for the hospital websites considered together, and the mean score was consistently higher for AI than non-AI
  • Scientific adequacy—AI and non-AI answers received similar ratings except in response to Q8, where the AI answer was rated 6.5 vs. 5.4 for non-AI
  • Satisfaction with the answer—AI and non-AI answers received similar ratings except in response to Q8, where the AI answer was rated 6.3 vs. 4.8 for non-AI

Recognizing AI

Overall, the raters were only 48% accurate in identifying AI-generated answers, although one of the fellows was 81% accurate.

That physician explained, "ChatGPT answers tended to be lengthy, used many colons (':') in the long list of possibilities it gave, and tended to be more of a list rather than a narrative paragraph in response." Answers from hospital webpages were "more like verbal responses to a patient as opposed to something more encyclopedic."

Readability

The reading level of all answers was evaluated with the Flesch-Kincaid Grade Level and the Gunning Fog Index. AI-generated answers were written at grade 13 or grade 16 level, depending on the index used. The hospital websites, considered together, were written at grade 9 or grade 11 level, a statistically significant difference.

However, even the websites were written at a significantly higher level than the eighth-grade reading level often recommended for patient information.

Conclusions

ChatGPT-generated medical information is constructed not from clinical evidence but through diverse internet material with reinforcement learning from human feedback. Other potential pitfalls: LLM outputs may be sensitive to subtle changes in prompts, and the consistency of performance is probably constantly changing.

Still, ChatGPT and other LLMs, such as BioGPT and BARD, may be a transformative innovation. Patients are contacting providers through electronic patient portals at an exponentially higher rate, a heavy burden for clinicians and staff. With appropriate physician oversight and periodic surveillance, AI-generated medical information could free providers for more cognitively intensive patient communications.

Learn about the Center for Neurointestinal Health

Refer a patient to the Division of Gastroenterology

Related topics

Related

With input from a Massachusetts General Hospital gastroenterologist, new colonoscopy guidelines update the ages to begin and end colorectal cancer screening.

Related

Ryan Flanagan, MD, MPH, Braden Kuo, MD, and Kyle Staller, MD, MPH, have provided the first evidence that Google Trends can be used to investigate the global burden of a functional gastrointestinal disorder, complementing traditional epidemiologic methods.