Department of Medicine

Assessing the Accuracy of Artificial Intelligence-Generated Clinical Summaries From Ambulatory Glaucoma Subspecialty Clinical Encounters.

Yapei Zhang
Min Shi
In Young Chung
Daniel L Liebman
Laura E Barna
Louis R Pasquale
David S Friedman
Michael V Boland
Lucy Q Shen
Mengyu Wang

Publication/Presentation Date

1-5-2026

Abstract

PURPOSE: The purpose of this study was to evaluate the accuracy of large language model (LLM) LLaMA 2-70B in summarizing glaucoma clinic notes into patient-friendly language and generating educational material.

METHODS: A random sample of 147 clinic notes from unique patients who visited Glaucoma Service at a tertiary center was analyzed. LLaMA 2 generated paragraph and bullet-point summaries in five subjects: (1) glaucoma diagnosis and type, (2) disease progression, (3) treatment plan, (4) treatment changes, and (5) surgical/laser interventions. Two ophthalmologists reviewed responses for accuracy and categorized them as "correct," "partially correct," or "incorrect." Discrepancies were adjudicated by a glaucoma specialist. A comparison using identical prompts was performed on a subset (n = 50) with ChatGPT-4.

RESULTS: LLaMA 2 correctly summarized 97 notes (66%) in paragraph and 103 (70%) in bullet format. Another 44 (30%) and 41 (28%) were partially correct, respectively. Paragraph summaries were more accurate and complete for glaucoma suspects than diagnosed patients (82% vs. 53%, P < 0.001). For targeted clinical questions, LLaMA 2 accurately identified glaucoma diagnosis in 118 notes (80%), disease stability/progression in 129 (88%), treatment plans in 127 (87%), treatment changes in 134 (91%), and surgical/laser interventions in 124 (84%). ChatGPT-4 achieved 46% correct paragraph summaries, 50% correct bullet summaries, and accuracies of 96%, 88%, 64%, 78%, and 82%, respectively, for targeted questions.

CONCLUSIONS: Although LLaMA 2 is not yet reliable as a standalone clinical tool, it shows promise to improve clinical communication.

TRANSLATION RELEVANCE: LLMs may enhance patient experience and health literacy by standardizing patient-friendly language in clinical care.

Volume

Issue

First Page

Last Page

ISSN

2164-2591

Published In/Presented At

Zhang, Y., Shi, M., Chung, I. Y., Liebman, D. L., Barna, L. E., Pasquale, L. R., Friedman, D. S., Boland, M. V., Shen, L. Q., & Wang, M. (2026). Assessing the Accuracy of Artificial Intelligence-Generated Clinical Summaries From Ambulatory Glaucoma Subspecialty Clinical Encounters. Translational vision science & technology, 15(1), 22. https://doi.org/10.1167/tvst.15.1.22

Disciplines

Medicine and Health Sciences

PubMedID

41532689

Department(s)

Department of Medicine

Document Type

Article

Link to Full Text

Find in your library

COinS

Department of Medicine

Assessing the Accuracy of Artificial Intelligence-Generated Clinical Summaries From Ambulatory Glaucoma Subspecialty Clinical Encounters.

Publication/Presentation Date

Abstract

Volume

Issue

First Page

Last Page

ISSN

Published In/Presented At

Disciplines

PubMedID

Department(s)

Document Type

Search

Browse

Author Corner

Department of Medicine

Assessing the Accuracy of Artificial Intelligence-Generated Clinical Summaries From Ambulatory Glaucoma Subspecialty Clinical Encounters.

Authors

Publication/Presentation Date

Abstract

Volume

Issue

First Page

Last Page

ISSN

Published In/Presented At

Disciplines

PubMedID

Department(s)

Document Type

Share

Search

Browse

Author Corner