OpenAI's o1 and o3 Models Show Promise in Medical Diagnosis, But Customization Still Wins

OpenAI's latest reasoning models, o1 and o3, failed to outperform customized versions of older models like GPT-4 and Claude 3.5 when answering medical questions about thyroid eye disease. Researchers at a major medical institution evaluated how well different large language models (LLMs), which are AI systems trained on vast amounts of text, could answer questions about this eye condition that affects thousands of patients annually .

Why Did Newer Models Underperform in Medical Tasks?

The study tested multiple AI models on their ability to answer questions about thyroid eye disease, a condition that often goes undiagnosed because patients and doctors lack awareness. Researchers evaluated GPT-4, Claude 3.5, and newer models with native chain-of-thought (CoT) capabilities, which is a reasoning technique that helps AI systems work through problems step-by-step. The newer models tested included GPT-4-o1, GPT-4-o3, Gemini-2.0-Flash, Gemini-2.5-Pro, and Claude 3.7 .

On multiple-choice questions about thyroid eye disease, the results were striking. GPT-4 achieved 76.2% accuracy, while Claude 3.5 scored 83.2%. When researchers added chain-of-thought prompting and customized these older models specifically for thyroid eye disease, the results improved significantly. CoT-GPT reached 86.1% accuracy, CoT-Claude achieved 87.1%, TED-GPT (a customized version for thyroid eye disease) scored 86.1%, and TED-Claude reached 89.1%. Importantly, all of these customized older models outperformed the newer reasoning models with native CoT capabilities .

How Can Clinicians Build Better AI Tools for Their Specialty?

The research demonstrates that doctors and medical institutions don't need to wait for the latest AI models to improve patient care. Instead, they can use practical customization techniques to enhance existing models for their specific medical domains.

  • Chain-of-Thought Prompting: This technique instructs AI models to explain their reasoning step-by-step before arriving at an answer, similar to how a doctor might walk through a diagnosis. This simple method improved accuracy across all tested models without requiring expensive retraining.
  • Domain-Specific Customization: Creating specialized versions of general-purpose models by training them on disease-specific information and medical literature relevant to a particular condition. TED-Claude, customized for thyroid eye disease, demonstrated the best overall performance in accuracy, readability, comprehensiveness, and reasoning quality.
  • Combination Approaches: Using both chain-of-thought prompting and customization together produced the strongest results, suggesting that these techniques complement each other when applied to medical question-answering tasks.

For case-based and short-answer questions, the pattern held. TED-Claude and TED-GPT consistently outperformed their original, non-customized versions across multiple evaluation metrics. The researchers used a framework called QUEST to assess quality, understanding and reasoning, expression, safety and harm, and trust in the AI responses .

This finding has significant implications for healthcare institutions. Rather than assuming that the newest AI models will automatically perform better, clinicians can leverage customization modules and reasoning techniques to adapt existing models to their specific needs. The study suggests that these universal methods are accessible to medical professionals without requiring deep technical expertise or massive computational resources.

The research also highlights an important gap in how advanced reasoning models are being developed. While OpenAI's o1 and o3 models represent significant engineering achievements in general reasoning capability, they may not automatically translate to superior performance in specialized domains like medicine. This suggests that the path to trustworthy medical AI may require a hybrid approach, combining newer reasoning capabilities with domain-specific fine-tuning rather than relying solely on frontier models .

As healthcare systems increasingly explore AI for patient education and clinical decision support, this research provides a practical roadmap. The findings indicate that clinicians can construct LLMs suitable for specific medical domains using simple, well-established techniques, potentially improving patient outcomes by ensuring AI systems provide accurate, comprehensive, and trustworthy medical information.