Assessing the Accuracy of Artificial Intelligence-Generated Clinical Summaries From Ambulatory Glaucoma Subspecialty Clinical Encounters.
Publication/Presentation Date
1-5-2026
Abstract
PURPOSE: The purpose of this study was to evaluate the accuracy of large language model (LLM) LLaMA 2-70B in summarizing glaucoma clinic notes into patient-friendly language and generating educational material.
METHODS: A random sample of 147 clinic notes from unique patients who visited Glaucoma Service at a tertiary center was analyzed. LLaMA 2 generated paragraph and bullet-point summaries in five subjects: (1) glaucoma diagnosis and type, (2) disease progression, (3) treatment plan, (4) treatment changes, and (5) surgical/laser interventions. Two ophthalmologists reviewed responses for accuracy and categorized them as "correct," "partially correct," or "incorrect." Discrepancies were adjudicated by a glaucoma specialist. A comparison using identical prompts was performed on a subset (n = 50) with ChatGPT-4.
RESULTS: LLaMA 2 correctly summarized 97 notes (66%) in paragraph and 103 (70%) in bullet format. Another 44 (30%) and 41 (28%) were partially correct, respectively. Paragraph summaries were more accurate and complete for glaucoma suspects than diagnosed patients (82% vs. 53%, P < 0.001). For targeted clinical questions, LLaMA 2 accurately identified glaucoma diagnosis in 118 notes (80%), disease stability/progression in 129 (88%), treatment plans in 127 (87%), treatment changes in 134 (91%), and surgical/laser interventions in 124 (84%). ChatGPT-4 achieved 46% correct paragraph summaries, 50% correct bullet summaries, and accuracies of 96%, 88%, 64%, 78%, and 82%, respectively, for targeted questions.
CONCLUSIONS: Although LLaMA 2 is not yet reliable as a standalone clinical tool, it shows promise to improve clinical communication.
TRANSLATION RELEVANCE: LLMs may enhance patient experience and health literacy by standardizing patient-friendly language in clinical care.
Volume
15
Issue
1
First Page
22
Last Page
22
ISSN
2164-2591
Published In/Presented At
Zhang, Y., Shi, M., Chung, I. Y., Liebman, D. L., Barna, L. E., Pasquale, L. R., Friedman, D. S., Boland, M. V., Shen, L. Q., & Wang, M. (2026). Assessing the Accuracy of Artificial Intelligence-Generated Clinical Summaries From Ambulatory Glaucoma Subspecialty Clinical Encounters. Translational vision science & technology, 15(1), 22. https://doi.org/10.1167/tvst.15.1.22
Disciplines
Medicine and Health Sciences
PubMedID
41532689
Department(s)
Department of Medicine
Document Type
Article