A new paradigm challenges the dominant approach of building ever-larger single AI models for diagnosis. Instead of collapsing multiple perspectives into one consensus answer, researchers at Karolinska Institutet propose MEDLEY (Medical Ensemble Diagnostic system with Leveraged diversitY), a framework that deliberately preserves disagreements between AI models and treats diversity of outputs as a clinical resource rather than a problem to solve. Why Is One AI Answer Actually Dangerous? When a single AI system presents a confident diagnosis with a polished explanation, clinicians are more likely to accept it without questioningâa phenomenon called automation bias. "A single model that presents a confident answer with a polished explanation can actually undermine human judgment by encouraging automation bias, the tendency to over-trust automated systems," explains Farhad Abtahi, researcher and manager of the SMAILE (Stockholm Medical AI and Learning Environments) core facility at Karolinska Institutet. The problem runs deeper than overconfidence. Large language models like ChatGPT and Claude are remarkably skilled at generating explanations that sound clinically plausible and authoritativeâeven when they're wrong. A recent study from New York Institute of Technology tested five advanced multimodal large language models (GPT-5, Gemini 3 Pro, Llama 4 Maverick, Grok4, and Claude Opus 4.5 Extended) on the same CT brain scan showing clear stroke. The results revealed a 20% rate of fundamental diagnostic error across the models. One model even misclassified an ischemic stroke as a hemorrhage on the opposite side of the brainâan error that could lead to completely wrong treatment in a real clinical setting, since these two stroke types require different therapies. What's particularly concerning is that even when models reached the correct diagnosis, their explanations differed greatly. Some disagreed on when the stroke occurred, others on alternative diagnoses and which brain regions were affected. When researchers asked the models to grade each other's explanations, additional inconsistencies emerged. One model systematically penalized others' responses because it interpreted the findings as chronic brain abnormalities rather than an acute stroke. How Does MEDLEY Keep Doctors in Control? Rather than relying on any single model's ability to explain itself, MEDLEY seeks reliability through the structured interplay of convergent and divergent perspectives across multiple models. The framework addresses a critical design challenge: presenting too much information at once can overwhelm clinicians and impair decision-making. MEDLEY solves this through progressive disclosureâa tiered approach to information presentation. The default clinical view shows only the consensus finding with a summary uncertainty indicator. Alternative and minority diagnoses are available, but only when the clinician chooses to expand them, typically for complex or ambiguous cases. The system also uses threshold-based activation: for routine high-consensus cases, MEDLEY presents streamlined output, reserving the full ensemble plurality for cases where disagreement genuinely adds diagnostic value. Visual encodings like confidence bands and divergence indicators convey ensemble-level patterns without requiring the clinician to process each model's output individually. When Does AI Disagreement Actually Matter Clinically? In a proof-of-concept demonstrator using over 30 large language models with diverse geographic, architectural, and temporal origins, researchers found that consensus rates varied widely across synthetic casesâfrom around 48% for complex conditions to over 90% for well-established diagnoses. Cases with lower consensus were enriched for rare or region-specific conditions, precisely where multiple perspectives add the most value. In one striking example, a single model trained on data from the Eastern Mediterranean region flagged a genetic condition that all other models missed. That minority output was presented with transparent provenance, allowing the clinician to decide whether to investigate further. This demonstrates that the approach is technically feasible and that meaningful patterns of agreement and disagreement emerge in real diagnostic scenarios. The framework also applies beyond differential diagnosis. In medical imaging, MEDLEY could make visible where different segmentation models disagree on tumor boundaries. In radiation therapy planning, those disagreements are not noiseâthey can be vital for avoiding radiation exposure to sensitive structures. Traditional ensemble methods aggregate these outputs into a single boundary and hide the discrepancies. MEDLEY preserves them for the oncologist to evaluate. How to Implement AI Disagreement in Clinical Practice - Progressive Disclosure Design: Present consensus findings by default, but allow clinicians to expand alternative diagnoses and minority views only when needed for complex cases, reducing cognitive overload while preserving access to diverse perspectives. - Transparent Model Provenance: Document each AI model's training background, geographic origin, and known biases so clinicians understand why disagreements exist and can weigh them in clinical context. - Visual Divergence Indicators: Use confidence bands and divergence visualizations to convey ensemble-level patterns at a glance, rather than forcing clinicians to process individual model outputs. - Threshold-Based Activation: Reserve full ensemble plurality for cases where disagreement genuinely adds diagnostic value, while streamlining output for routine high-consensus cases. Why Bias in AI Models Isn't Always Bad Conventionally, bias in artificial intelligence is treated as a defect. MEDLEY takes a different view: bias reflects the data a model was trained onâwhich populations, institutions, and clinical practices. Rather than treating this as purely negative, MEDLEY documents bias as a form of specialization. A model trained predominantly on data from East Asian populations may recognize certain conditions better than one trained in Northern Europe, and vice versa. The key is making these differences transparent so clinicians can weigh them in context. This does not mean all bias is acceptable. The framework draws clear ethical boundaries: bias that reinforces stereotypes, encodes discriminatory proxies, or substitutes statistical correlation for clinical causation is never acceptable. But a biased model contributing to a transparency-preserving ensemble is ethically distinct from the same model deployed as a standalone decision-maker. What's the Difference Between Task-Specific AI and General Language Models? Most successful medical AI tools are task-specific algorithms trained on large datasets of labeled medical images and validated for very specific diagnostic tasks. Large language models, by contrast, are not optimized for diagnosticsâthey are built for linguistics and conversation. "Our research highlights a critical distinction in the AI landscape," says Milan Toma, Ph.D., Associate Professor at the New York Institute of Technology College of Osteopathic Medicine. "However, large language models are not optimized for diagnosticsâthey are built for linguistics and conversation. Accordingly, they generate explanations that sound authoritative, even when their underlying interpretation is wrong or inconsistent". The future of healthcare AI will likely combine both specialized diagnostic systems and language models. However, while large language models may be useful for clinical documentation, summarizing reports, or communicating with patients, oversight from a medical expert remains non-negotiable for all diagnostic interpretations. The paradigm shift proposed by MEDLEY reflects a fundamental truth: the goal of medical AI should not be to replace human judgment, but to enhance it. By preserving disagreement, documenting bias, and keeping clinicians actively engaged in reasoning, this framework treats artificial intelligence as a tool for structured consultation rather than an oracle delivering final answers.