Skip to main content

Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.

Publication ,  Journal Article
Rossettini, G; Bargeri, S; Cook, C; Guida, S; Palese, A; Rodeghiero, L; Pillastrini, P; Turolla, A; Castellini, G; Gianola, S
Published in: Front Digit Health
2025

INTRODUCTION: Artificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots' performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain. METHODS: We performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1. RESULTS: We found high variability in the text consistency of AI chatbot responses (median range 26%-68%). Intra-rater reliability ranged from "almost perfect" to "substantial," while inter-rater reliability varied from "almost perfect" to "moderate." Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate. CONCLUSIONS: Despite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots.

Duke Scholars

Published In

Front Digit Health

DOI

EISSN

2673-253X

Publication Date

2025

Volume

7

Start / End Page

1574287

Location

Switzerland

Related Subject Headings

  • 4203 Health services and systems
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Rossettini, G., Bargeri, S., Cook, C., Guida, S., Palese, A., Rodeghiero, L., … Gianola, S. (2025). Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study. Front Digit Health, 7, 1574287. https://doi.org/10.3389/fdgth.2025.1574287
Rossettini, Giacomo, Silvia Bargeri, Chad Cook, Stefania Guida, Alvisa Palese, Lia Rodeghiero, Paolo Pillastrini, Andrea Turolla, Greta Castellini, and Silvia Gianola. “Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.Front Digit Health 7 (2025): 1574287. https://doi.org/10.3389/fdgth.2025.1574287.
Rossettini, Giacomo, et al. “Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.Front Digit Health, vol. 7, 2025, p. 1574287. Pubmed, doi:10.3389/fdgth.2025.1574287.
Rossettini G, Bargeri S, Cook C, Guida S, Palese A, Rodeghiero L, Pillastrini P, Turolla A, Castellini G, Gianola S. Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study. Front Digit Health. 2025;7:1574287.

Published In

Front Digit Health

DOI

EISSN

2673-253X

Publication Date

2025

Volume

7

Start / End Page

1574287

Location

Switzerland

Related Subject Headings

  • 4203 Health services and systems