Examiner stratification reveals clinically relevant variability in large language model answers to endodontic patient questions.
Saeed S Alqahtani, Hmoud Ali Algarni, Meshal Aber Alonazi, Azhar Iqbal, Osama Khattak, Ravi Jothish, Mohmed Isaqali Karobari
INTRODUCTION: Large language models (LLMs) are increasingly used by patients seeking endodontic information, yet their clinical reliability and safety in patient-centred communication remain uncertain. METHODS: This study evaluated the clinical reliability and safety of three contemporary LLMs (ChatGPT GPT-4o, Claude Sonnet 4.5, and Gemini 3 Flash) using 50 patient-centred endodontic questions (35 frequently asked questions and 15 scenario-based prompts). Each question was submitted six times per model in independent sessions. Responses were anonymised and independently assessed by four examiners using a structured Clinical Reliability and Safety Framework. Due to poor inter-examiner agreement, analyses were conducted using examiner stratification. Reproducibility was assessed using word count variability, embedding-based semantic similarity, and lexical distance metrics. RESULTS: Statistically significant differences in clinical reliability were observed across all examiners. ChatGPT consistently received the lowest scores, whereas Gemini most frequently achieved the highest ratings. Model differentiation was clearer for structured frequently asked questions and selected clinical domains than for scenario-based prompts. All models demonstrated stable response lengths across repeated runs. Gemini showed the highest semantic consistency despite greater surface-level rewording. DISCUSSION: Contemporary LLMs demonstrate clinically meaningful variability beyond factual accuracy, particularly in safety framing and clinical actionability. Reliability is influenced by question structure and clinical context. Multidimensional, examiner-aware evaluation frameworks are necessary to meaningfully assess safety and support responsible integration of LLMs into endodontic patient communication.
Read on ELI