Rapid Evaluation of Artificial Intelligence Technology Used for Ambient Dictation in Primary Care: Comparing the Quality of Documentation of Artificial Intelligence-Generated and Human-Produced Clinical Notes.
Ashok Reddy, Eric Gunnink, Chelle L Wheat, Scott Pawlikowski, Chína M Payne, Scott Wiltz, Terrence L Hubert, Susan Kirsh, Evan Carey, Donna Hill, Karin M Nelson
BACKGROUND: Ambient artificial intelligence (AI) scribes can reduce the burden of administrative documentation. Prior evaluations have been vendor specific and not focused on measures of documentation quality. OBJECTIVE: To compare the quality of AI-generated clinical notes with that of human-produced notes. DESIGN: Cross-sectional evaluation of notes generated from standardized primary care clinical cases. SETTING: Veterans Health Administration (VHA). PARTICIPANTS: 11 AI scribe tools, 18 human note takers, and 30 human raters. INTERVENTION: Five standardized primary care cases were audio recorded using standardized patients (for example, new patient, back pain, chest pain, pharmacy, and nurse care manager). Vendors and human clinicians generated encounter notes from the audio files. MEASUREMENTS: Blinded raters assessed all notes using the modified Physician Documentation Quality Instrument (PDQI-9), which measures 10 domains of note quality on a 5-point Likert scale (maximum score 50). RESULTS: Across all 5 clinical cases, human-generated notes received higher overall modified PDQI-9 scores than AI-generated notes. The largest difference was seen in the acute low back pain case (human: 43.8 [95% CI, 37.4 to 50.3] vs. AI: 20.3 [CI, 15.4 to 25.2]; difference -23.5 [CI, -29.2 to -17.9]). Pooled domain analysis showed lower AI scores across all 10 domains, with the largest deficits in domains related to being thorough (-1.23 [CI, -1.82 to -0.65]), organized (-1.06 [CI, -1.65 to -0.47]), and useful (-1.03 [CI, -1.61 to -0.44]). LIMITATION: Cases were simulated; human-generated notes were not generated under real-world constraints. CONCLUSION: Notes generated by AI had lower-quality scores than human-generated notes across 5 standardized care cases. Although ambient AI scribes hold promise for reducing clinician burden, independent, vendor-neutral evaluations of note quality are essential before large-scale clinical deployment. PRIMARY FUNDING SOURCE: VHA.
Read on ELI