Scholars@Duke publication: Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard.

Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard.

Publication , Journal Article

Lang, SP; Yoseph, ET; Gonzalez-Suarez, AD; Kim, R; Fatemi, P; Wagner, K; Maldaner, N; Stienen, MN; Zygourakis, CC

Published in: Neurospine

June 2024

Published version (DOI) Open Access Copy (Duke) Link to item

OBJECTIVE: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education. METHODS: Our study aims to assess the response quality of Open AI (artificial intelligence)'s ChatGPT 3.5 and Google's Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from 'unsatisfactory' to 'excellent.' The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale. RESULTS: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard's responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism. CONCLUSION: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs' role in medical education and healthcare communication.

Duke Scholars

Author Parastou F. Quist Neurosurgery

Published In

Neurospine

DOI

10.14245/ns.2448098.049

ISSN

2586-6583

Publication Date

June 2024

Volume

Issue

Start / End Page

633 / 641

Location

Korea (South)

Citation

APA

Chicago

ICMJE

MLA

NLM

Lang, S. P., Yoseph, E. T., Gonzalez-Suarez, A. D., Kim, R., Fatemi, P., Wagner, K., … Zygourakis, C. C. (2024). Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard. Neurospine, 21(2), 633–641. https://doi.org/10.14245/ns.2448098.049

Lang, Siegmund Philipp, Ezra Tilahun Yoseph, Aneysis D. Gonzalez-Suarez, Robert Kim, Parastou Fatemi, Katherine Wagner, Nicolai Maldaner, Martin N. Stienen, and Corinna Clio Zygourakis. “Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard.” Neurospine 21, no. 2 (June 2024): 633–41. https://doi.org/10.14245/ns.2448098.049.

Lang SP, Yoseph ET, Gonzalez-Suarez AD, Kim R, Fatemi P, Wagner K, et al. Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard. Neurospine. 2024 Jun;21(2):633–41.

Lang, Siegmund Philipp, et al. “Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard.” Neurospine, vol. 21, no. 2, June 2024, pp. 633–41. Pubmed, doi:10.14245/ns.2448098.049.

Lang SP, Yoseph ET, Gonzalez-Suarez AD, Kim R, Fatemi P, Wagner K, Maldaner N, Stienen MN, Zygourakis CC. Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard. Neurospine. 2024 Jun;21(2):633–641.

Published In

Neurospine

DOI

10.14245/ns.2448098.049

ISSN

2586-6583

Publication Date

June 2024

Volume

Issue

Start / End Page

633 / 641

Location

Korea (South)