Performance of large language models on the radiation and cancer biology practice exam.
Jessica Bertschmann, Yang Xu, Conrad Bayley, Ahmad Abdellatif, Sangjune Laurence Lee
BACKGROUND/OBJECTIVES: Large Language Models (LLMs) are increasingly used in medicine for tasks ranging from patient communication to exam preparation. This study aimed to evaluate the feasibility of using a domain-specific, out-of-training-data radiation and cancer biology examination as a benchmarking framework for large language models, and to compare the accuracy and consistency of commonly used LLMs available at the time of data collection. METHODS: GPT-3.5, GPT-4, and Llama-2 were queried with 335 multiple-choice questions (MCQs) from the 2023 American Society for Radiation Oncology (ASTRO) Radiation and Cancer Biology Exam Study Guide, excluding image-based items. Each model answered all questions five times over three months to evaluate consistency. Model responses were scored against the official answer key and analyzed using one-way ANOVA with Bonferroni correction to determine statistical differences in accuracy. RESULTS: GPT-4 achieved the highest accuracy, correctly answering 81% of questions, significantly outperforming GPT-3.5 (62%) and Llama-2 (51%) (p < 0.001). All models performed worse on questions requiring calculations, though differences were not statistically significant. In terms of reliability, GPT-4 and Llama-2 provided consistent responses more frequently than GPT-3.5. Despite stable overall scores, all models exhibited variability in individual responses across repeated trials. GPT-4 produced the longest explanations, averaging 183 words per answer. CONCLUSIONS: This study demonstrates the feasibility of using a domain-specific, out-of-training-data examination to benchmark large language model knowledge in radiation and cancer biology. While performance differences were observed among models, variability and limitations, particularly in calculation-based questions, highlight the importance of methodological benchmarking and cautious interpretation when considering medical educational applications.
Read on ELI