JAMA Network Open Publishes Criteria for Manuscripts Reporting Clinical Use of AI
Key findings
- Each month JAMA Network Open receives dozens of manuscripts evaluating clinical applications of large language models (LLMs, also called chatbots), and to assist authors the journal's editors have developed criteria for submission
- The guidance address the needs for clinical relevance, replication, investigation of alternative prompts and wrong results, fairness, and confidentiality, and it encourages authors to compare LLM results to an established expert source
- The editors recommend against submitting comparisons of LLM versions, as such reports rarely interest nonspecialists
Large language models (LLMs), better known as artificial intelligence (AI) chatbots, are being implemented at breakneck speed for multiple purposes in medicine, including responding to patients' questions, generating notes from clinical encounters, creating and answering test questions, assisting with diagnoses, and guiding therapeutic interactions.
Subscribe to the latest updates from Psychiatry Advances in Motion
Each month, JAMA Network Open receives dozens of manuscripts evaluating various applications of LLM, and its reviewers recognize the need to develop criteria for submission.
In the October 2, 2023 issue of JAMA Network Open, two editors of the journal, Roy Perlis, MD, MSc, a psychiatrist who directs the Center for Quantitative Health at Massachusetts General Hospital, and Stephan D. Fihn, MD, MPH, of the University of Washington, present the following guidance for authors:
- The model under evaluation must be clinically meaningful. Ideally, researchers should address a range of related experimental conditions relevant to clinical settings, not a single, narrow area.
- Describe experimental conditions with sufficient detail to allow replication. For example, specify the version of the model since it could change. In addition, report tunable parameters (e.g., temperature) of the LLM that affect its output.
- The same prompt may yield quite different responses from run to run, so repeat the same request multiple times. Estimate effects or associations and provide a measure of variability. Consider the extent to which alternate prompts may change outputs; these can be useful sensitivity analyses in establishing the robustness of results.
- The journal does not publish comparisons between LLM versions, as these rarely interest nonspecialists. Studies evaluating the most current versions of an LLM are of greatest interest.
- Go beyond reporting basic indices of accuracy to describe characteristics of incorrect or flawed responses. For example, a response to a case presentation that recommends a suboptimal but reasonable treatment would have very different consequences from a response that recommends a contraindicated treatment. Attempt to understand how and when models report wrong results and give examples of such responses.
- Assess bias and fairness, as LLMs trained on the breadth of the internet are susceptible to providing biased responses. Consider whether modifying prompts to reflect individual characteristics yields different results.
- If the manuscript applies LLMs to clinical data, report how confidentiality was maintained (e.g., by hosting LLMs within an institutional firewall). Among the most promising applications of LLMs may be summarization and interpretation of clinical data, but simply uploading such data to a third party is a breach of confidentiality.
- As Alan Turing advised in Mind nearly 75 years ago, whenever possible, compare LLM results to some established expert reference rather than simply describing model's output. Authors are strongly encouraged to make these reference standards available as part of publication. If a standard comes from clinicians or other human participants, consider carefully whether review by a human participants committee is required. In many cases, the individuals establishing a reference standard may be considered research participants, and this should be determined by the institutional review board.
The authors say they expect these standards to evolve as the strengths and limitations of generative AI become better understood.
view original journal article Subscription may be required
Learn more about the Center for Quantitative Health
Learn more about the Department of Psychiatry