GPT-4 and Claude 3.5 Outperform Newer AI Models in Medical Diagnosis: What This Means for Patient Care
Older AI models, when customized for specific medical tasks, are outperforming their newer counterparts, according to a new study published in Nature. Researchers evaluated how well large language models (LLMs), which are AI systems trained on vast amounts of text, could answer questions about thyroid eye disease (TED), a condition that often goes undiagnosed because patients lack awareness of its symptoms. The results challenge the assumption that the newest AI models are always the best for specialized work .
Which AI Models Performed Best at Medical Questions?
The research team tested multiple AI models on multiple-choice medical questions related to thyroid eye disease. GPT-4, OpenAI's flagship model, and Claude 3.5, made by Anthropic, emerged as the strongest performers, achieving baseline accuracies of 76.2% and 83.2% respectively. However, the real breakthrough came when researchers customized these models specifically for thyroid disease and added a technique called Chain-of-Thought (CoT) prompting, which encourages AI systems to reason through problems step-by-step .
The customized versions showed dramatic improvements. TED-Claude, the thyroid-specialized version of Claude 3.5, achieved 89.1% accuracy on multiple-choice questions. TED-GPT, the customized version of GPT-4, reached 86.1% accuracy. Even more impressive, when Chain-of-Thought reasoning was added to these models, CoT-Claude achieved 87.1% accuracy and CoT-GPT hit 86.1%. These customized, older models significantly outperformed newer AI systems including GPT-4o1, GPT-4o3, Gemini 2.0 Flash, Gemini 2.5 Pro, and Claude 3.7 .
How Can Doctors Use These Findings to Improve Patient Education?
The study demonstrates that clinicians and medical institutions don't need to wait for the latest AI model to deploy effective AI-powered patient education tools. Instead, they can take practical steps to adapt existing, proven models for their specific needs:
- Customization Modules: Medical teams can use simple customization features built into platforms like OpenAI's GPT builder and Anthropic's Claude Projects to train models on disease-specific information and medical guidelines without requiring advanced technical expertise.
- Chain-of-Thought Prompting: Doctors can instruct AI systems to explain their reasoning step-by-step when answering patient questions, which improves both accuracy and patient understanding of complex medical concepts.
- Multi-Question Evaluation: Healthcare providers should test AI models not just on multiple-choice questions but also on short-answer and case-based scenarios using frameworks like QUEST, which evaluates quality, reasoning, expression, safety, and trustworthiness of responses.
When evaluated on more complex short-answer and case-based questions, TED-Claude and TED-GPT continued to outperform their original versions. TED-Claude showed the best overall performance in accuracy, readability, comprehensiveness, likelihood of harm, and reasoning ability .
The implications extend beyond thyroid disease. The research suggests that any medical specialty, from cardiology to oncology, could follow this same approach to create AI tools tailored to their patient education needs. As the researchers noted, these methods are "simple and universal," meaning they can be applied across different medical domains without requiring specialized AI development teams .
This finding arrives at a critical moment for AI in healthcare. While newer, more powerful models like GPT-5 are in development, hospitals and clinics often lack the resources or expertise to implement cutting-edge AI systems. The study shows that proven, accessible models like GPT-4 and Claude 3.5, when properly customized, can deliver medical-grade performance for patient-facing applications. The key is not always having the newest technology, but rather knowing how to adapt existing tools effectively for your specific use case.