Abstract:
Background Large language models(LLMs) are gaining public familiarity and are increasingly adopted in healthcare contexts. Thyroid cancer represents a common malignancy in China,where patients express substantial unmet needs for evidence-based disease information. Nevertheless,no studies have assessed the quality and readability of LLM-generated responses regarding thyroid cancer in the Chinese context. Objective To evaluate and compare the quality and readability of responses generated by domestic large language models(LLMs) to thyroid cancer-related queries. Methods The Douyin Index was used to identify a set of 25 questions pertaining to thyroid cancer. Response texts were generated using DeepSeek (DeepSeek-R1-0120),Qwen(qwen-max-2025-01-25),and GLM(GLM-4Plus). Cosine similarity is a metric used to evaluate the similarity between texts generated at different time points,thereby assessing the stability of the model. To assess the quality of the information,the modified version of the Health Information Quality Assessment Tool(mDISCERN) was employed. Additionally,the Chinese Readability Formula was utilized to evaluate the readability of the texts. To explore the differences in the quality and stability of response text information between models,the following methodologies are applied,cluster heatmaps,principal component analysis(PCA),Friedman tests,and signed rank tests. Additionally,Pearson correlation analysis is used to examine the relationship between information quality and readability. Results The text similarity evaluation results show that the proportion of moderately similar texts on Deepseek is 12%,the proportion of highly similar texts is 88%,and the proportion of highly similar texts in the two responses of Qwen and GLM is 100%. A comparative analysis of information quality and readability across the three models showed statistically significant differences(P<0.001). Specifically,DeepSeek demonstrated superior performance in terms of information quality,as indicated by a significant chi-squared test result(Z=35.396,P<0.001). However,its readability was comparatively lower(R=7.525±1.006). Qwen and GLM exhibited comparable information quality,with GLM outperforming in question clusters 2 and 3,while Qwen excelled in responding to question cluster 1. The overall correlation between information quality and readability was found to be negative(r=-0.370,P=0.010). Conclusion LLMs in China have significant potential to provide essential health education to patients with thyroid cancer. However,concerns have been raised regarding inaccuracies in the generated content and the occurrence of AI hallucinations. When patients actually apply LLMs to obtain health information,they should consider comprehensively in combination with the response texts from different platforms and the doctor's suggestions. In terms of the model,it is necessary to balance the professionalism and popularity of the information and establish a medical content security review mechanism to ensure the accuracy and professionalism of the information.