Evaluation of large language models in a pulmonology outpatient clinic using structured clinical data and chest radiographs: a single-center prospective observational study.

Hayriye Bektaş Aksoy, Selda Günaydın, Şaban Melih Şimşek, Ruhsel Cörüt

INTRODUCTION: Large language models (LLMs) may support clinical reasoning, yet real-world outpatient studies integrating structured clinical data with chest radiographs (CXRs) remain limited. We compared three LLMs for pulmonary differential diagnosis in routine clinical practice. METHODS: In this prospective, single-center observational study, consecutive adult outpatients presenting with respiratory complaints between 06 October and 31 December 2025 were enrolled. For each case, a standardized structured clinical form and de-identified CXRs were provided to three LLMs (ChatGPT-5.2, Google Gemini 3 Flash, and Microsoft Copilot) and to three blinded pulmonologists. The primary diagnosis was assigned by the examining pulmonologist, and the reference diagnosis was defined by agreement of at least two blinded pulmonologists. Concordance and Cohen's kappa were assessed. RESULTS: A total of 120 patients were included. Agreement among the blinded pulmonologists was high, and agreement between the primary and reference diagnoses was excellent. Compared with the reference diagnosis, ChatGPT-5.2 and Microsoft Copilot showed higher concordance than Google Gemini 3 Flash, with both demonstrating moderate overall agreement. Concordance did not differ by age or sex. Across diagnostic categories, performance was highest for pneumonia/upper respiratory tract infection and asthma. DISCUSSION: In this real-world pulmonology outpatient cohort, ChatGPT-5.2 and Microsoft Copilot showed better diagnostic concordance than Google Gemini 3 Flash when structured clinical data and CXRs were evaluated together. These findings support the potential role of LLMs as adjunctive decision-support tools in pulmonology, while also indicating that performance remains diagnosis-dependent and insufficient to replace expert clinical judgment.

Read on ELI