Education Research: Quality of Narrative Feedback Generated by a Large Language Model Compared With Expert Faculty for Case-Based Learning in Neurology Education.

Hannah Fruitman, Sasha Severin, Atikul Miah, Christina Gao, Haelynn Gim, Carolyn Qian, Sang-O Park, Kelly Hou, Edward L Kong, Benjamin Cook, Jasmin Le, Brandon Stretton, John Maddison, Liam G McCoy, Luke Collins, Andrew Vanlint, Rudy Goh, Matthew Arnold, Aye Thant, Rani Priyanka Vasireddy, Doris Kung, Ashley M Paul, Haatem Reda, Tamara B Kaplan, Adam Karp, Galina Gheihman

BACKGROUND AND OBJECTIVES: Neurology learners often receive limited feedback in clinical settings because of workflow constraints, variability in supervision, and competing clinical demands. Artificial intelligence, including large language models (LLMs) may help address these gaps and provide clinical learners with effective formative feedback by generating real-time, case-specific feedback during neurology case-based learning (CBL). The aim of this study was to examine how the quality of LLM-generated feedback compares with human expert-generated feedback in neurology CBL. METHODS: In this exploratory quantitative study, student participants undertook LLM-enabled interactive cases on the TEACHABLE platform, which included history gathering, physical examination elements, and ordering diagnostic testing. Participants were clinical-level students recruited from 2 medical institutions. Case transcripts were recorded and analyzed for feedback generation, which was provided by an LLM and human experts in 2 components: history taking/physical examination elements (H&P) and assessment and plan (A&P). Feedback characteristics including sentence count, word count, and reference to case key learning points were summarized and compared. Feedback quality was scored by blinded experts using the QuAL and EFeCT instruments. Results were compared for the H&P and A&P components of the case interactions. RESULTS: Four student participants completed 5 interactive cases each, generating 20 total transcripts for feedback. Word and sentence number were similar among LLM-generated and expert-generated feedback, except for a greater word length in expert-generated A&P feedback. Regarding H&P, the LLM commented on the key learning points in 20/20 (100%) of the cases as compared with 39/60 (65%) for the human experts. For A&P, the LLM feedback discussed key points in 20/20 (100%) cases as compared with 39/40 (97.5%) for the human experts. The LLM feedback had no medical inaccuracies. QuAL and EFeCT scores were significantly greater for the LLM as compared with human experts for the H&P component, but not significantly different for the A&P component. DISCUSSION: LLMs provided with key learning points can generate timely, quality feedback on case-based interactions in a manner comparable with human experts. A hybrid framework combining LLM-generated feedback with faculty input may offer high-quality and equitably accessible formative feedback at scale. These pilot findings are limited by a small sample size and experimental setting.

Read on ELI