Skip to main content
Journal cover image

Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models.

Publication ,  Journal Article
Forero, DA; Abreu, SE; Tovar, BE; Oermann, MH
Published in: Journal of the American Medical Informatics Association : JAMIA
September 2025

To explore the performance of 4 large language model (LLM) chatbots for the analysis of 2 of the most commonly used tools for the advanced analysis of systematic reviews (SRs) and meta-analyses.We explored the performance of 4 LLM chatbots (ChatGPT, Gemini, DeepSeek, and QWEN) for the analysis of ROBIS and AMSTAR 2 tools (sample sizes: 20 SRs), in comparison with assessments by human experts.Gemini showed the best agreement with human experts for both ROBIS and AMSTAR 2 (accuracy: 58% and 70%). The second best LLM chatbots were ChatGPT and QWEN, for ROBIS and AMSTAR 2, respectively.Some LLM chatbots underestimated the risk of bias or overestimated the confidence of the results in published SRs, which is compatible with recent articles for other tools.This is one of the first studies comparing the performance of several LLM chatbots for the automated analyses of ROBIS and AMSTAR 2.

Duke Scholars

Published In

Journal of the American Medical Informatics Association : JAMIA

DOI

EISSN

1527-974X

ISSN

1067-5027

Publication Date

September 2025

Volume

32

Issue

9

Start / End Page

1471 / 1476

Related Subject Headings

  • Systematic Reviews as Topic
  • Natural Language Processing
  • Meta-Analysis as Topic
  • Medical Informatics
  • Large Language Models
  • Language
  • Humans
  • Bias
  • 46 Information and computing sciences
  • 42 Health sciences
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Forero, D. A., Abreu, S. E., Tovar, B. E., & Oermann, M. H. (2025). Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models. Journal of the American Medical Informatics Association : JAMIA, 32(9), 1471–1476. https://doi.org/10.1093/jamia/ocaf117
Forero, Diego A., Sandra E. Abreu, Blanca E. Tovar, and Marilyn H. Oermann. “Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models.Journal of the American Medical Informatics Association : JAMIA 32, no. 9 (September 2025): 1471–76. https://doi.org/10.1093/jamia/ocaf117.
Forero DA, Abreu SE, Tovar BE, Oermann MH. Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models. Journal of the American Medical Informatics Association : JAMIA. 2025 Sep;32(9):1471–6.
Forero, Diego A., et al. “Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models.Journal of the American Medical Informatics Association : JAMIA, vol. 32, no. 9, Sept. 2025, pp. 1471–76. Epmc, doi:10.1093/jamia/ocaf117.
Forero DA, Abreu SE, Tovar BE, Oermann MH. Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models. Journal of the American Medical Informatics Association : JAMIA. 2025 Sep;32(9):1471–1476.
Journal cover image

Published In

Journal of the American Medical Informatics Association : JAMIA

DOI

EISSN

1527-974X

ISSN

1067-5027

Publication Date

September 2025

Volume

32

Issue

9

Start / End Page

1471 / 1476

Related Subject Headings

  • Systematic Reviews as Topic
  • Natural Language Processing
  • Meta-Analysis as Topic
  • Medical Informatics
  • Large Language Models
  • Language
  • Humans
  • Bias
  • 46 Information and computing sciences
  • 42 Health sciences