USF-LVHN SELECT

AI-Powered Workflow for Constructing Organic Materials Databases from the Literature: Integrating Large Language Models.

Publication/Presentation Date

10-28-2025

Abstract

We developed an end-to-end workflow to automate the construction of materials science databases from published literature, addressing a traditionally manual, time-intensive, and labor-intensive process. The work systematically evaluates and compares different machine learning (ML) methods to optimize each task. For identifying relevant publications, we tested various ML techniques and concluded that a combination of large language model (LLM)-based embeddings, clustering, and direct LLM queries is most effective. In the subsequent data extraction phase, we employed OpenAI's GPT-4 to extract materials and their properties, achieving accuracy comparable to manually curated data sets. Additionally, we integrated AI/ML methods to automatically generate SMILES from chemical structure images, expanding the workflow's applicability to organic materials. To validate the workflow, we applied it to studying organic donor materials in organic photovoltaic devices and benchmarked its performance against a manually curated data set derived from 503 papers. The results demonstrate the workflow's efficiency and accuracy. Finally, based on our findings, we provide recommendations for selecting the best ML methods for each task and propose further improvements for the future tool development. This workflow represents a major advancement in accelerating the development of materials science databases and enables data science applications in a broader range of research topics that were historically infeasible due to the lack of available data sets.

Volume

10

Issue

42

First Page

49545

Last Page

49556

ISSN

2470-1343

Disciplines

Medical Education | Medicine and Health Sciences

PubMedID

41179185

Department(s)

USF-LVHN SELECT Program, USF-LVHN SELECT Program Students

Document Type

Article

Share

COinS