Artificial Intelligence Promising As a Decision Support Tool When Treating Bipolar Depression
Key findings
- This study evaluated the ability of two large language models (a form of artificial intelligence) to serve as decision support tools: one augmented with evidence-based guidelines for pharmacologic treatment of bipolar depression and one unaugmented
- Three experts in mood disorders were presented with 50 clinical vignettes and were asked to identify and rank for each vignette the five best next-step treatments and the five worst or contraindicated next-step treatments
- The level of agreement between the augmented model and expert opinion was fair (Cohen's κ, 0.31), and for 51% of vignettes the augmented model selected what the experts considered the optimal next-step medication
- The performance of the augmented model compared favorably with that of the unaugmented model, and a sample of 27 community clinicians experienced in treating bipolar disorder
- Randomized trials are needed to determine whether application of the augmented model can improve clinical outcomes without increasing risk
In a preliminary study available as a preprint, Roy H. Perlis, MD, MSc, director of the Center for Quantitative Health at Massachusetts General Hospital, demonstrated that a large language model (LLM, a form of artificial intelligence) could approximate the performance of clinicians in selecting an antidepressant for patients with major depression. However, for 48% of vignettes, the model selected a poor or contraindicated option.
Subscribe for the latest updates from Psychiatry Advances in Motion
As part of a new proof-of-concept study, Dr. Perlis and colleagues "augmented" the LLM with a prompt incorporating a summary of evidence-based guidelines on the pharmacologic treatment of bipolar depression. This approach, which does not require the model to retrieve information, is a rapidly evolving strategy in AI research and adds flexibility versus purely algorithmic prescribing.
The research team reports in Neuropsychopharmacology that the augmented model's choices about treatment for bipolar depression agreed only modestly with those of a group of experts. However, on average it performed significantly better than the unaugmented model—and better than a sample of community clinicians.
Model Design
The augmented LLM had a prompt comprising three sections: the context and task, the knowledge to be used in selecting treatment and a clinical vignette. The knowledge section was an excerpt from the U.S. Department of Veterans Affairs 2023 guidelines on bipolar disorder relating to pharmacologic management of depression. The prompt asked the LLM to return a ranked list of the five best next-step interventions.
The researchers also studied an unaugmented model that had a shortened version of the prompt, without the guideline. Both models used GPT-4 turbo (gpt-4-1106-preview).
Study Methods
The research team used electronic health records at Mass General Brigham to generate 50 vignettes for individuals with bipolar disorder type 1 or 2 who were experiencing a major depressive episode.
Opinions about optimal pharmacologic options for each vignette were collected from three experts, each with more than 20 years of mood disorder practice and experience leading mood disorder clinics. They were presented with all 50 vignettes and were asked to identify and rank for each vignette the five best next-step treatments to consider and the five worst or contraindicated next-step treatments.
Rankings From the Models
In the primary analysis, results from the augmented model were compared with expert opinion:
- The level of agreement was fair (Cohen's κ, 0.31)
- The model identified the optimal treatment for 51% of vignettes
- On average, 3.7 of the model's medications appeared among the expert top five
- For 12% of vignettes the model selected a medication considered by the experts to be a poor or contraindicated choice
When the analyses were repeated with the unaugmented ("base") model:
- The level of agreement with experts was poor (κ, 0.09)
- The model identified the optimal treatment for only 23% of vignettes (P<0.001 vs. the augmented model)
- On average, 2.8 of the model's medications appeared among the expert top five
- For 11% of vignettes the model selected a poor or contraindicated choice
Rankings From Community Clinicians
For further comparison, 27 community-based clinicians (10 psychiatrists, 12 psychiatric nurse practitioners, 2 non-psychiatric nurse practitioners, 2 physician assistants and 1 primary care physician) who regularly treat bipolar disorder received the same instructions as the experts and evaluated 20 vignettes drawn from the 50.
Their ability to match expert selections was worse than that of the augmented model:
- Their level of agreement with experts was poor (κ, 0.07)
- They identified the optimal treatment for only 23% of vignettes
- On average, 2.2 of their medication selections appeared among the expert top five
- For 22% of vignettes they selected a poor or contraindicated choice
The results were similar when only vignettes scored by psychiatrists were analyzed.
Next Steps
The poor agreement between community clinicians and experts emphasizes the lack of consensus among prescribers about treatment of bipolar depression. Randomized trials are needed to determine whether the augmented model can improve clinical outcomes without increasing risk.
More broadly, the results suggest the potential utility of using LLMs to provide a guideline-based standard of care in other clinical settings. Furthermore, to approximate clinical practice more closely an LLM can readily incorporate additional information, such as statements of clinician preferences, standards of care in a specific health system or a list of adverse effects acceptable to an individual patient.
view original journal article Subscription may be required
Learn more about the Center for Quantitative Health
Learn more about the Department of Psychiatry