Skip to main content

Prompting is All You Need: How to Make LLMs More Helpful for Clinical Decision Support.

Publication ,  Journal Article
Dymm, B; Goldenholz, DM
Published in: medRxiv
February 24, 2026

IMPORTANCE: Large language models (LLMs) offer potential decision support, but their accuracy varies. Prompt engineering can generally enhance LLM behavior in a clinical context, yet best practices have yet to be formally explored in realistic clinical contexts for neurology. OBJECTIVE: To evaluate the impact of structured prompting versus naive prompting on the performance of four LLMs (two closed-source: OpenAI GPT-4o, OpenAI o3; three open-source: Meta Llama-4-Scout-17B-16E-Instruct, Llama-3.3-70B-Instruct-Turbo, and the reasoning model r1-1776) for thrombolytic clinical decision support (CDS) in acute stroke. DESIGN: Models responded to three novel ischemic stroke vignettes using either a naive question ("Should this patient be offered thrombolytics?") or a five-step structured prompt (CARDS) guiding information extraction, timing analysis, contraindication checking, decision process explanation, and risk-benefit discussion. Outputs were assessed across seven domains: guideline adherence, unsafe recommendations, risk recognition, guideline grading accuracy, inclusion of conversational explanation, clarity, and overall helpfulness. RESULTS: Structured prompts significantly enhanced performance across most domains, with varying effects between model families. For closed-source models (GPT-4o, o3), prompts structured in the CARDS style improved guideline adherence from 83.3% to 100%, eliminated unsafe recommendations (16.7% to 0%), and increased specific guideline grading accuracy from 0% to 100%. Similarly, the open-source reasoning model r1-1776 achieved these top-tier outcomes (100% adherence, 0% unsafe, 100% grading, 100% conversation) when structured prompts were applied, with grading and conversation improving from 0%. In contrast, other open-source models (Llama-4-Scout, Llama-3.3-70B) showed more modest gains: risk recognition improved (83.3% to 100%) and guideline grading accuracy increased (0% to 66.7%), while guideline adherence (66.7%) and unsafe recommendations (33.3%) persisted. Overall, structured prompting yielded the largest improvements in guideline grading accuracy and conversational reasoning across multiple models. CONCLUSION AND RELEVANCE: Structured prompting substantially enhances LLM performance for acute stroke thrombolysis CDS. Notably, some models, including the proprietary GPT-4o and o3, and the open-source reasoning model r1-1776, achieved excellent safety and adherence with structured prompts. For clinical deployment of any LLM, structured prompts are crucial, and vigilant human oversight remains essential.

Duke Scholars

Published In

medRxiv

DOI

Publication Date

February 24, 2026

Location

United States
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Dymm, B., & Goldenholz, D. M. (2026). Prompting is All You Need: How to Make LLMs More Helpful for Clinical Decision Support. MedRxiv. https://doi.org/10.64898/2026.02.12.26346005
Dymm, Braydon, and Daniel M. Goldenholz. “Prompting is All You Need: How to Make LLMs More Helpful for Clinical Decision Support.MedRxiv, February 24, 2026. https://doi.org/10.64898/2026.02.12.26346005.
Dymm, Braydon, and Daniel M. Goldenholz. “Prompting is All You Need: How to Make LLMs More Helpful for Clinical Decision Support.MedRxiv, Feb. 2026. Pubmed, doi:10.64898/2026.02.12.26346005.

Published In

medRxiv

DOI

Publication Date

February 24, 2026

Location

United States