Skip to main content

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

Publication ,  Journal Article
Hanss, K; Sarma, KV; Glowinski, AL; Krystal, A; Saunders, R; Halls, A; Gorrell, S; Reilly, E
Published in: J Med Internet Res
May 20, 2025

BACKGROUND: Large language models (LLMs), such as OpenAI's GPT-3.5, GPT-4, and GPT-4o, have garnered early and significant enthusiasm for their potential applications within mental health, ranging from documentation support to chat-bot therapy. Understanding the accuracy and reliability of the psychiatric "knowledge" stored within the parameters of these models and developing measures of confidence in their responses (ie, the likelihood that an LLM response is accurate) are crucial for the safe and effective integration of these tools into mental health settings. OBJECTIVE: This study aimed to assess the accuracy, reliability, and predictors of accuracy of GPT-3.5 (175 billion parameters), GPT-4 (approximately 1.8 trillion parameters), and GPT-4o (an optimized version of GPT-4 with unknown parameters) with standardized psychiatry multiple-choice questions (MCQs). METHODS: A cross-sectional study was conducted where 3 commonly available, commercial LLMs (GPT-3.5, GPT-4, and GPT-4o) were tested for their ability to provide answers to single-answer MCQs (N=150) extracted from the Psychiatry Test Preparation and Review Manual. Each model generated answers to every MCQ 10 times. We evaluated the accuracy and reliability of the answers and sought predictors of answer accuracy. Our primary outcome was the proportion of questions answered correctly by each LLM (accuracy). Secondary measures were (1) response consistency to MCQs across 10 trials (reliability), (2) the correlation between MCQ answer accuracy and response consistency, and (3) the correlation between MCQ answer accuracy and model self-reported confidence. RESULTS: On the first attempt, GPT-3.5 answered 58.0% (87/150) of MCQs correctly, while GPT-4 and GPT-4o answered 84.0% (126/150) and 87.3% (131/150) correctly, respectively. GPT-4 and GPT-4o showed no difference in performance (P=.51), but they significantly outperformed GPT-3.5 (P<.001). GPT-3.5 exhibited less response consistency on average compared to the other models (P<.001). MCQ response consistency was positively correlated with MCQ accuracy across all models (r=0.340, 0.682, and 0.590 for GPT-3.5, GPT-4, and GPT-4o, respectively; all P<.001), whereas model self-reported confidence showed no correlation with accuracy in the models, except for GPT-3.5, where self-reported confidence was weakly inversely correlated with accuracy (P<.001). CONCLUSIONS: To our knowledge, this is the first comprehensive evaluation of the general psychiatric knowledge encoded in commercially available LLMs and the first study to assess their reliability and identify predictors of response accuracy within medical domains. The findings suggest that GPT-4 and GPT-4o encode accurate and reliable general psychiatric knowledge and that methods, such as repeated prompting, may provide a measure of LLM response confidence. This work supports the potential of LLMs in mental health settings and motivates further research to assess their performance in more open-ended clinical contexts.

Duke Scholars

Published In

J Med Internet Res

DOI

EISSN

1438-8871

Publication Date

May 20, 2025

Volume

27

Start / End Page

e69910

Location

Canada

Related Subject Headings

  • Surveys and Questionnaires
  • Reproducibility of Results
  • Psychiatry
  • Medical Informatics
  • Large Language Models
  • Humans
  • Cross-Sectional Studies
  • 4203 Health services and systems
  • 17 Psychology and Cognitive Sciences
  • 11 Medical and Health Sciences
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Hanss, K., Sarma, K. V., Glowinski, A. L., Krystal, A., Saunders, R., Halls, A., … Reilly, E. (2025). Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study. J Med Internet Res, 27, e69910. https://doi.org/10.2196/69910
Hanss, Kaitlin, Karthik V. Sarma, Anne L. Glowinski, Andrew Krystal, Ramotse Saunders, Andrew Halls, Sasha Gorrell, and Erin Reilly. “Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.J Med Internet Res 27 (May 20, 2025): e69910. https://doi.org/10.2196/69910.
Hanss K, Sarma KV, Glowinski AL, Krystal A, Saunders R, Halls A, et al. Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study. J Med Internet Res. 2025 May 20;27:e69910.
Hanss, Kaitlin, et al. “Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.J Med Internet Res, vol. 27, May 2025, p. e69910. Pubmed, doi:10.2196/69910.
Hanss K, Sarma KV, Glowinski AL, Krystal A, Saunders R, Halls A, Gorrell S, Reilly E. Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study. J Med Internet Res. 2025 May 20;27:e69910.

Published In

J Med Internet Res

DOI

EISSN

1438-8871

Publication Date

May 20, 2025

Volume

27

Start / End Page

e69910

Location

Canada

Related Subject Headings

  • Surveys and Questionnaires
  • Reproducibility of Results
  • Psychiatry
  • Medical Informatics
  • Large Language Models
  • Humans
  • Cross-Sectional Studies
  • 4203 Health services and systems
  • 17 Psychology and Cognitive Sciences
  • 11 Medical and Health Sciences