Skip to content

Artificial Intelligence Promising As a Decision Support Tool When Treating Bipolar Depression

Key findings

  • This study evaluated the ability of two large language models (a form of artificial intelligence) to serve as decision support tools: one augmented with evidence-based guidelines for pharmacologic treatment of bipolar depression and one unaugmented
  • Three experts in mood disorders were presented with 50 clinical vignettes and were asked to identify and rank for each vignette the five best next-step treatments and the five worst or contraindicated next-step treatments
  • The level of agreement between the augmented model and expert opinion was fair (Cohen's κ, 0.31), and for 51% of vignettes the augmented model selected what the experts considered the optimal next-step medication
  • The performance of the augmented model compared favorably with that of the unaugmented model, and a sample of 27 community clinicians experienced in treating bipolar disorder
  • Randomized trials are needed to determine whether application of the augmented model can improve clinical outcomes without increasing risk

In a preliminary study available as a preprint, Roy H. Perlis, MD, MSc, director of the Center for Quantitative Health at Massachusetts General Hospital, demonstrated that a large language model (LLM, a form of artificial intelligence) could approximate the performance of clinicians in selecting an antidepressant for patients with major depression. However, for 48% of vignettes, the model selected a poor or contraindicated option.

As part of a new proof-of-concept study, Dr. Perlis and colleagues "augmented" the LLM with a prompt incorporating a summary of evidence-based guidelines on the pharmacologic treatment of bipolar depression. This approach, which does not require the model to retrieve information, is a rapidly evolving strategy in AI research and adds flexibility versus purely algorithmic prescribing.

The research team reports in Neuropsychopharmacology that the augmented model's choices about treatment for bipolar depression agreed only modestly with those of a group of experts. However, on average it performed significantly better than the unaugmented model—and better than a sample of community clinicians.

Model Design

The augmented LLM had a prompt comprising three sections: the context and task, the knowledge to be used in selecting treatment and a clinical vignette. The knowledge section was an excerpt from the U.S. Department of Veterans Affairs 2023 guidelines on bipolar disorder relating to pharmacologic management of depression. The prompt asked the LLM to return a ranked list of the five best next-step interventions.

The researchers also studied an unaugmented model that had a shortened version of the prompt, without the guideline. Both models used GPT-4 turbo (gpt-4-1106-preview).

Study Methods

The research team used electronic health records at Mass General Brigham to generate 50 vignettes for individuals with bipolar disorder type 1 or 2 who were experiencing a major depressive episode.

Opinions about optimal pharmacologic options for each vignette were collected from three experts, each with more than 20 years of mood disorder practice and experience leading mood disorder clinics. They were presented with all 50 vignettes and were asked to identify and rank for each vignette the five best next-step treatments to consider and the five worst or contraindicated next-step treatments.

Rankings From the Models

In the primary analysis, results from the augmented model were compared with expert opinion:

  • The level of agreement was fair (Cohen's κ, 0.31)
  • The model identified the optimal treatment for 51% of vignettes
  • On average, 3.7 of the model's medications appeared among the expert top five
  • For 12% of vignettes the model selected a medication considered by the experts to be a poor or contraindicated choice

When the analyses were repeated with the unaugmented ("base") model:

  • The level of agreement with experts was poor (κ, 0.09)
  • The model identified the optimal treatment for only 23% of vignettes (P<0.001 vs. the augmented model)
  • On average, 2.8 of the model's medications appeared among the expert top five
  • For 11% of vignettes the model selected a poor or contraindicated choice

Rankings From Community Clinicians

For further comparison, 27 community-based clinicians (10 psychiatrists, 12 psychiatric nurse practitioners, 2 non-psychiatric nurse practitioners, 2 physician assistants and 1 primary care physician) who regularly treat bipolar disorder received the same instructions as the experts and evaluated 20 vignettes drawn from the 50.

Their ability to match expert selections was worse than that of the augmented model:

  • Their level of agreement with experts was poor (κ, 0.07)
  • They identified the optimal treatment for only 23% of vignettes
  • On average, 2.2 of their medication selections appeared among the expert top five
  • For 22% of vignettes they selected a poor or contraindicated choice

The results were similar when only vignettes scored by psychiatrists were analyzed.

Next Steps

The poor agreement between community clinicians and experts emphasizes the lack of consensus among prescribers about treatment of bipolar depression. Randomized trials are needed to determine whether the augmented model can improve clinical outcomes without increasing risk.

More broadly, the results suggest the potential utility of using LLMs to provide a guideline-based standard of care in other clinical settings. Furthermore, to approximate clinical practice more closely an LLM can readily incorporate additional information, such as statements of clinician preferences, standards of care in a specific health system or a list of adverse effects acceptable to an individual patient.

Learn more about the Center for Quantitative Health

Learn more about the Department of Psychiatry

Related

Maurizio Fava, MD, and colleagues created an artificial intelligence–assisted method for predicting each individual's propensity to respond to placebo in randomized, controlled trials of interventions for major depressive disorder, which can make treatment arms more comparable and boost the efficacy signal.

Related

Aiming to publish research on large language models (LLMs) as rapidly as the technology is evolving—while insisting on customary standards such as validity, fairness, and transparency—the editors of JAMA Network Open have developed criteria to guide authors submitting LLM-related reports.