Artificial Intelligence in Medical Assessment: Reliability and Performance of Multimodal Large Language Models in a High-Stakes Licensing Examination.

Ibrahim Güler, Gerrit Grieb, Armin Kraus, Philipp Moog, Uzay Cambaz, Ezgi Yavasca, Henrik Stelling

Artificial intelligence (AI) is increasingly integrated into assessment contexts, yet evidence on the reliability and measurement properties of large language models (LLMs) in high-stakes evaluation settings remains limited. This study examines the performance and reproducibility of contemporary multimodal LLMs in a structured medical assessment environment. A cross-sectional dual-setup design was applied using a complete national medical licensing examination (240 multiple-choice items, including image-based questions). Setup 1 evaluated ten models in a single run to characterize overall performance. Setup 2 assessed six models across five independent runs each to quantify measurement stability. Accuracy with 95% confidence intervals, inter-run agreement using Cohen's kappa, and paired comparisons using McNemar's test were analyzed. Accuracy ranged from 72.08% to 92.92%. All models demonstrated near-perfect inter-run agreement (mean κ ≥ 0.96) with minimal variability. After correction, only a small number of pairwise comparisons remained significant, indicating convergence among leading systems. In an exploratory submodule, performance on the small set of image-based items was comparable to or slightly higher than performance on text-only items. These findings demonstrate that multimodal LLMs achieve high accuracy and high inter-run reproducibility on a large-scale assessment, supporting their use as objects of AI-based assessment research while leaving questions of cognitive equivalence with human examinees beyond the scope of accuracy-based evaluation.

Read on ELI