When AI Stumbles on Medical Summaries: Why ChatGPT Beats Specialized Models at Understanding Health Research

A team of researchers at the University of Split discovered something counterintuitive: when it comes to understanding the certainty level of medical research summaries, a general-purpose AI chatbot outperforms specialized deep learning models built specifically for the task. The study, which tested two transformer-based language models against ChatGPT, found that the general-purpose model achieved 74.2% accuracy compared to 56.6% and 60.9% for the specialized alternatives .

What Are Transformer Models and Why Do They Matter for Medical Text?

Transformer models are a type of artificial intelligence architecture that excel at understanding language by analyzing patterns in text. Researchers selected two specific variants for this study: SciBERT, a transformer model trained on 1.14 million scientific papers primarily from health sciences, and Longformer, a transformer designed to process lengthy documents efficiently . The team fine-tuned these models, a process where researchers adjust a pre-trained model to specialize in a particular task, using a dataset of 4,405 Cochrane plain language summaries .

Cochrane plain language summaries are stand-alone documents that translate complex systematic reviews into language accessible to patients, caregivers, and policymakers without medical training. These summaries should be written at or below a sixth-grade reading level to ensure broad comprehension . The challenge researchers tackled was whether AI could automatically classify these summaries based on how conclusive their findings are, which matters because patients rely on clear conclusions to make informed health decisions.

Why Did the Specialized Models Underperform?

The results were surprising and somewhat disappointing for the specialized approach. SciBERT achieved a balanced accuracy of 56.6%, while Longformer reached 60.9%, both falling significantly short of ChatGPT's 74.2% accuracy . When the researchers tested the fine-tuned models on newly published summaries they hadn't seen during training, performance dropped to near-chance levels, suggesting the models failed to generalize beyond their training data .

The researchers attributed this underperformance to semantic overlap and subtle linguistic differences in how medical conclusions are expressed. Medical summaries often contain nuanced language where the difference between "conclusive," "inconclusive," and "unclear" findings can be expressed in ways that are difficult for specialized models to distinguish . Additionally, the fine-tuned models struggled with the complexity of real-world medical language in ways that the broader, more general ChatGPT model did not.

How Do These Models Classify Medical Summaries?

The classification task involved three categories reflecting different levels of certainty in medical findings:

  • Conclusive: Statements indicating a large effect size with high certainty of evidence, such as "Intervention causes a large reduction in outcome."
  • Inconclusive: Statements suggesting mixed or moderate evidence where the effect is unclear or requires further research.
  • Unclear: Statements indicating very low certainty of evidence, such as "It is unclear if intervention has an effect on outcome."

The distinction matters because patients who receive conclusive health information rely less on healthcare professionals to make treatment decisions, potentially improving their autonomy and decision-making quality . However, previous research found that 50% to 80% of Cochrane summaries enabled readers to reach a relevant conclusion, while many had unclear or missing conclusions regarding intervention efficacy and safety .

Steps to Improve AI Classification of Medical Summaries

  • Leverage General-Purpose Models: Consider using established large language models like ChatGPT for classification tasks rather than assuming domain-specific fine-tuning will outperform them, as this study demonstrates.
  • Combine Human and AI Review: Use AI classification as a first pass to flag summaries for human verification, rather than relying entirely on automated systems for critical medical content.
  • Expand Training Data Quality: Ensure training datasets contain diverse examples of how medical conclusions are expressed across different research areas and writing styles.
  • Test on New Data: Always evaluate models on recently published summaries they haven't encountered during training to ensure they generalize to real-world applications.

The research team used Python programming language and the PyTorch framework to develop their models, computing evaluation metrics through the scikit-learn machine learning library . They measured model performance using the area under the curve of the receiver operating characteristic, a standard metric that balances sensitivity and specificity. For SciBERT, this metric reached 0.91 for conclusive statements, 0.67 for inconclusive, and 0.75 for unclear statements . Longformer achieved 0.86, 0.67, and 0.72 respectively .

What Do These Findings Mean for Medical AI Applications?

The study suggests that general-purpose large language models like ChatGPT may currently offer more reliable results for practical classification tasks in biomedical applications compared to specialized fine-tuned models . This finding challenges a common assumption in AI development: that building domain-specific models always produces better results than using general-purpose systems.

The implications extend beyond medical summaries. If general-purpose models outperform specialized ones in understanding nuanced medical language, this could reshape how healthcare organizations approach AI implementation. Rather than investing heavily in custom model development, organizations might achieve better results by leveraging existing general-purpose AI systems and focusing resources on data preparation and human oversight.

The researchers tested their models on a separate set of 213 plain language summaries, comparing predictions against manual human verification and ChatGPT outputs . This validation step was crucial because it revealed the real-world performance gap between the specialized models and the general-purpose alternative. The findings underscore the importance of testing AI systems on data they haven't encountered during development, as this reveals whether models truly understand the task or merely memorize patterns from training data.