Performance of three large language models in answering parent-focused questions on rickets: a dual pediatric-orthopedic specialist evaluation.
Ahmet Murat Çörekci, Belen Ateş, Ahmet Alperen Öztürk, Mustafa Özdemir
BACKGROUND: Parents increasingly rely on large language models (LLMs) to obtain pediatric health information; however, the accuracy, clinical appropriateness, and readability of AI-generated responses remain variable. This concern is particularly relevant for rickets, a preventable metabolic bone disease in which delayed recognition or inappropriate guidance may result in adverse outcomes. This study aimed to compare the content quality, clinical appropriateness, and readability of responses generated by contemporary LLMs to parent-oriented questions about rickets using structured, multidisciplinary expert evaluation. METHODS: Twenty-two frequently asked parent-oriented questions regarding rickets were identified from authoritative patient education resources and categorized into four thematic domains. Each question was posed to three LLMs (GPT-5.1, DeepSeek V3.2, and Gemini 3 Pro) using a standardized parent-focused prompt. Responses were collected as single-turn outputs between November 16 and 20, 2025. All responses were anonymized and independently evaluated by four clinicians (two pediatricians and two orthopedic surgeons). Content quality was assessed using a modified Artificial Intelligence Evaluation Score for Common Patient Questions (AIES-CPQ; range 5–25) and the Global Quality Scale (GQS; range 1–5). Readability was analyzed using five established indices. Inter-model differences were assessed using the Friedman test with Bonferroni-adjusted Wilcoxon signed-rank post-hoc comparisons, and inter-rater reliability was evaluated using intraclass correlation coefficients. RESULTS: Significant differences were observed among models for both AIES-CPQ and GQS scores ( CONCLUSIONS: LLM-generated responses to parent-oriented questions about rickets vary substantially in quality, clinical appropriateness, and readability. While newer-generation models provide higher-quality information, none demonstrate uniformly reliable performance across all domains. Structured, disease-specific evaluation frameworks combined with multidisciplinary expert oversight are essential before AI-generated content can be safely integrated into parent-facing pediatric education. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12887-026-06851-1.
Read on ELI