A Comparative Analysis of AI-Language Models' MCQ Performance versus Medical Students Across Different Pediatric Topics.

Olena Bolgova, Volodymyr Mavrych, Eyad Almidani, Turki Alshareef, Sabri Kemahlı

BACKGROUND: Large Language Models (LLMs) are increasingly used in medical education, yet their performance in specialized fields like pediatrics remains understudied. OBJECTIVE: To evaluate and compare the performance of four leading LLMs on a standard set of pediatric multiple-choice questions (MCQs) and benchmark their results against those of medical students. METHODS: We assessed 4 leading LLMs: Copilot (Microsoft), Claude (Anthropic), ChatGPT (OpenAI), and Gemini (Google), on 120 MCQs from six pediatric topics: Pulmonology, Developmental Diseases, Infectious Diseases, General Pediatrics, Neonatology, and Nephrology. Each LLM attempted the questions three times to evaluate consistency. Medical student performance on the same questions served as a benchmark, with random responses providing baseline control. RESULTS: The LLMs demonstrated an average accuracy of 80.4±10.9%, outperforming medical students (72.1±13.1%) by 8.3% and exceeding random responses (20.5±6.2%) by nearly fourfold. Copilot (84.5±8.3%) and Claude (84.2±9.9%) achieved the highest accuracy, followed by GPT-4o (80.9±9.1%) and Gemini (72.0±18.9%). Among all LLM-to-student comparisons, only GPT-4o showed a statistically significant difference in accuracy (χ CONCLUSION: Modern LLMs demonstrate proficiency in pediatric knowledge that generally exceeds medical student performance, though with varying consistency across topics. These findings suggest LLMs may serve as valuable supplementary tools in medical education while highlighting the need for further improvements in specialized medical domains like nephrology and neonatology.

Read on ELI