Large Language Model Responses to Common Otolaryngological Questions: Evaluating Accuracy, Appropriateness, Readability, and Hallucinations.

Mitali Sakharkar, Parth Jalihal, Sarah Chang, Prachi Patel, Melissa Chavez, Jessica Levi

OBJECTIVES: Artificial Intelligence (AI) is increasingly integrated into medicine, including otolaryngology. However, concerns remain regarding the accuracy of generated content and the tendency of large language models (LLMs) to fabricate references. This study evaluates the accuracy, appropriateness, readability, and hallucination of references in 2 prevalent large language models, ChatGPT and Claude, in response to common otolaryngological questions. STUDY DESIGN: Prospective observational study. SETTINGS: Academic tertiary care center. METHODS: Thirty-six otolaryngologic questions were individually entered into ChatGPT 4.0 Plus and CLAUDE in separate sessions, with explicit instructions to avoid utilizing previous memories. To assess reproducibility, each query was submitted twice. Two otolaryngologists independently rated the accuracy of responses. Readability was evaluated using the Flesch Reading Ease (FRE) score. Reference hallucinations were assessed by analyzing the reference validity and relevance. RESULTS: ChatGPT and CLAUDE had an FRE of 47 and 25.2 out of 100, respectively. For patient readability, ChatGPT scored a 3.60 while Claude scored a 4.68 out of 5. Claude scored slightly higher on accuracy, receiving a score of 4.42 out of 5 while ChatGPT received a 3.81. Both models hallucinated at least half of their references, with some citations irrelevant or incorrectly formatted. Thematic analysis revealed frequent vagueness, poor clinical prioritization, and excessive jargon across both models. CONCLUSION: Both ChatGPT and CLAUDE often produced partially inaccurate, jargon-filled responses and failed to consistently provide valid references when answering common otolaryngologic patient questions. Our results highlight the need for better understanding and regulation of LLM limitations in clinical and patient-facing applications.

Read on ELI