Supporting postgraduate exam preparation with large language models: implications for traditional Chinese medicine education

Source: Frontiers Medicine

Original: https://www.frontiersin.org/articles/10.3389/fmed.2025.1667104...

Published: 2026-01-09T00:00:00Z

The study evaluated the performance of four large language models (Ernie Bot, ChatGLM, SparkDesk, GPT-4) on the 2023 Chinese Postgraduate Examination in Traditional Chinese Medicine (CPE-TCM). Ernie Bot achieved an accuracy of 50.30% and ChatGLM achieved an accuracy of 46.67%, both exceeding the passing threshold. The performance of the models differed statistically significantly between subjects, with the highest score in the medical humanistic spirit module. ChatGLM and GPT-4 provided logical explanations for all responses; Ernie Bot in 98.2% and SparkDesk in 43.6% of responses. ChatGLM and GPT-4 always used internal information, while SparkDesk rarely did. More than 60% of Ernie Bot, ChatGLM, and GPT-4 model responses also contained external information, with no significant difference between correct and incorrect responses. For SparkDesk, the presence of internal or external information was statistically significantly associated with correctness of responses (P < 0.001). The authors conclude that the models demonstrated a solid level of TCM knowledge, logical reasoning, and the ability to integrate background knowledge, indicating their potential to support TCM education.