Deficiencies in clinical reasoning of LLMs in low back pain management and remediation via prompt engineering: from performance evaluation to error diagnosis.
Jia-Hui Luo, Yi-Lin Wang, Min-Jun Zhao, Jian-Li Yin, Dao-Fang Ding, Xu-Bo Wu
BACKGROUND: Large language models show promise in medical tasks, but their systematic error patterns in high-stakes clinical settings remain poorly understood, limiting safe deployment. METHODS: A three-phase simulation study was conducted. In phase 1, researchers selected 103 multiple-choice questions and 30 clinical scenario questions, derived from an LBP examination question bank and clinical guidelines and systematically evaluated five mainstream LLMs (GPT-5, GPT-4o, GPT-o3, Deepseek-V2.5, and Grok-4) across six dimensions: accuracy, completeness, practicality, readability, safety, and output stability. In Phase 2, two clinical coders independently performed qualitative content analysis on responses with low scores from Phase 1 (≤ 3 any dimension), classified error types, and calculated inter-rater reliability (Cohen's κ = 0.84); consensus was reached through discussion. In Phase 3, targeted safety-oriented prompts were designed for the high-risk error categories identified in Phase 2, and a separate linear mixed model was fitted for each of the five evaluation dimensions ( RESULTS: All five models achieved accuracy rates exceeding 90% on the general LBP knowledge test, demonstrating solid foundational knowledge. GPT-4o exhibited the highest overall clinical quality score and output stability. Error attribution revealed that lower-performing models, particularly Deepseek-V2.5, produced more safety-critical errors, including factual hallucinations and omissions of safety warnings. Targeted prompt engineering produced significant improvements for Deepseek-V2.5 across all five evaluation dimensions ( CONCLUSION: Although all models demonstrated strong foundational medical knowledge, their ability to translate this knowledge into reliable clinical guidance varied substantially. Critical concerns extend beyond factual accuracy to encompass safety and completeness in real-world clinical contexts. Human oversight remains indispensable. Clinicians should recognize the distinct strengths and limitations of different models and select tools according to their specific clinical use cases. Structured prompt design and systematic fact-checking represent the most practical and scalable approaches to enhancing safety, particularly in resource-limited settings. This study contributes to a more nuanced understanding of LLMs' capabilities and risks in chronic disease management and provides a replicable methodological foundation for future clinical AI evaluation.
Read on ELI