Evaluating large language models for pharmacotherapy simulations: a mixed-methods study.

Ahmed N Farrag, Amany El-Zeiny, Amani M Ali

Simulation-based learning is essential in clinical pharmacy education but requires substantial faculty resources that limit scalability. Large language models (LLMs) offer promise for generating scalable simulations, yet their pedagogical rigor and clinical reliability remain unclear. In a mixed-methods, counterbalanced evaluation study, PharmD students (n = 104) engaged with acute myeloid leukemia (AML) or chronic myeloid leukemia (CML) cases, conditions requiring complex longitudinal management yet sharing semantic similarity, generated by four LLMs using expert-guided meta-prompts. Expert panels evaluated sessions across clinical authenticity, instructional design, and clinical reasoning; students completed satisfaction surveys. Of 103 sessions, 53 (51.5%) met passing criteria across all domains. Clinical accuracy and safety emerged as the limiting domain (58.3%) compared to clinical reasoning (81.6%) and instructional design (82.5%). CML sessions outperformed AML sessions (62.3% vs 40.0%; p = 0.031). Platform success rates ranged from 34.5% to 62.1%. Error analysis revealed guideline misalignment, pharmacotherapeutic inaccuracies, fabricated evidence, and cross-condition therapeutic recommendations occurring exclusively in AML sessions. Students favored LLMs over traditional methods (49.8% vs 30.0%); however, we did not detect statistically significant alignment between student satisfaction and expert-assessed quality. Sessions more frequently met criteria for instructional design and clinical reasoning than for pharmacotherapeutic accuracy and guideline alignment. Expert oversight with platform-specific and disease-specific validation remains essential for safe educational deployment, and effectiveness trials assessing objective learning outcomes represent necessary subsequent work.

Read on ELI