Stanford's 2026 AI Index Reveals a Troubling Gap: Models Excel at Expert Questions But Fail at Reading Clocks

Artificial intelligence models are advancing at breathtaking speed on complex benchmarks, yet they're stumbling on tasks most humans find trivial. According to Stanford University's 2026 AI Index report, the latest frontier models now correctly answer roughly 50 percent of questions from "Humanity's Last Exam," a benchmark featuring the toughest problems from subject-matter experts across multiple fields. That's a stunning jump from just 8.8 percent accuracy a year ago. But the same models that ace expert-level reasoning often fail at something far simpler: reading an analog clock.

Why Are AI Models Brilliant at Some Tasks But Terrible at Others?

The clock-reading paradox points to a deeper issue in how modern AI systems process information. OpenAI's GPT-5.4 achieved the best performance on ClockBench, a benchmark measuring multimodal large language models' (LLMs) ability to read analog clocks, yet it still got the answer right only about 50 percent of the time. Anthropic's Claude Opus 4.6, which scored at the top of Humanity's Last Exam, managed just 8.9 percent accuracy on clock reading.

This disconnect isn't random. Researchers have identified a surprising pattern: when AI systems are asked questions combining language with images, audio, or other non-text information, the language component tends to dominate the decision-making process, sometimes to the point of completely ignoring the visual or audio information.

"There is a research thread that shows that when systems are asked questions about combinations of language with other modalities, for example images or audio as in tone of voice, the language component carries a surprisingly large part of the burden, even to the extent of ignoring non-language information completely," noted Ray Perrault, co-director of the AI Index steering committee.

Ray Perrault, Co-Director of the AI Index Steering Committee, Stanford University

The implications are significant. While LLMs will rarely be asked to read clocks in real-world applications, this weakness suggests that AI systems may struggle with any task requiring genuine integration of visual and textual information. In fields like medical imaging, legal document review, or scientific research, this limitation could prove costly.

How to Understand AI's Current Capabilities and Limitations

  • Benchmark Performance vs. Real-World Use: High scores on academic benchmarks don't always translate to practical effectiveness. A model scoring 75 percent on a legal reasoning benchmark tells us little about how well it would actually function in a law practice's daily operations.
  • Multimodal Weakness: AI systems struggle when required to process and integrate information from multiple sources simultaneously, such as combining text with images or audio cues, even when each modality is processed well individually.
  • Task-Specific Variability: The same model can excel at reasoning about complex expert-level questions while failing at basic visual recognition tasks, suggesting AI capabilities are highly specialized rather than broadly intelligent.

What's Driving the Rapid Progress on Complex Benchmarks?

The dramatic improvements in expert-level reasoning reflect broader trends in AI development. Agentic AI, which refers to AI systems that can autonomously plan and execute tasks, has experienced the most extreme gains. Models are rapidly improving on benchmarks like OSWorld, which measures autonomous computer use, and SWE-Bench Verified, which evaluates autonomous coding capabilities.

The investment fueling this progress is staggering. In 2025, the AI industry attracted over $581 billion in investment, more than double the $253 billion spent in 2024. The United States alone received over $344 billion of that total, cementing its lead in AI development. This capital influx has enabled companies like OpenAI, Anthropic, and Google to train increasingly powerful models on massive datasets.

Compute capacity has also exploded. According to data from Epoch AI, the world's total AI compute capacity has increased more than threefold every year since 2022, growing 30-fold since 2021. Nvidia's graphics processing units (GPUs) account for over 60 percent of all AI compute capacity globally, making the company the primary beneficiary of the AI build-out.

Industry dominance in AI development has become overwhelming. In 2025, organizations released 87 notable AI models from industry sources, compared to just seven from academic and government institutions combined. This represents a dramatic shift from 2015, when industry accounted for just under 50 percent of notable model releases.

What Are the Hidden Costs of Training Frontier AI Models?

The rapid advancement comes with environmental consequences that are growing harder to ignore. Training the latest frontier large language models generates enormous carbon emissions. xAI's Grok 4, one of the newest frontier models, is estimated to have generated over 72,000 tons of carbon-equivalent emissions during training. That's a dramatic increase from OpenAI's GPT-4, estimated at 5,184 tons, and Meta's Llama 3.1 405B, estimated at 8,930 tons.

However, these figures come with significant uncertainty. Ray Perrault cautioned that the Grok 4 estimate relies heavily on inferred inputs from public reporting and non-verifiable sources. Epoch AI independently estimates Grok 4's emissions at approximately 140,000 tons of carbon dioxide, nearly double the Stanford estimate.

Emissions from running trained models, known as inference, also continue to increase. The efficiency gap between models is striking. DeepSeek's V3 models consume around 23 watts when responding to a medium-length prompt, while Anthropic's Claude 4 Opus consumes about 5 watts for the same task. That means the least efficient models consume over 10 times as much energy as the most efficient ones.

Where Is AI Research Actually Happening Now?

The shift toward industry-led AI development has reshaped the research landscape. The United States released 50 "notable" AI models in 2025, maintaining its lead, but China's output is beginning to close the gap. However, China has already established a commanding lead in robotics deployment. In 2024, China installed 295,000 industrial robots, compared to roughly 44,500 in Japan and 34,200 in the United States.

Beyond corporate labs, grassroots enthusiasm for AI is surging on GitHub, the platform where developers share code. The number of AI-related projects has rocketed to 5.58 million through 2025, representing a roughly fivefold increase since 2020 and a 23.7 percent increase from 2024 alone.

AI adoption is accelerating fastest in medicine and drug discovery. The number of publications on AI use for drug discovery has more than doubled over the past two years. Publications on multimodal biomedical AI, which examines medical images alongside text, have increased 2.7 times over the same period. These trends suggest that despite concerns about AI's limitations, researchers and practitioners are finding genuine value in deploying these systems for specialized tasks.

The 2026 AI Index paints a picture of a technology advancing rapidly in some dimensions while remaining surprisingly brittle in others. Models can now reason through expert-level problems with impressive accuracy, yet they struggle with visual tasks that children master effortlessly. As AI systems move from research labs into hospitals, law offices, and engineering firms, understanding these gaps becomes increasingly critical. The challenge ahead isn't just building smarter models, but building systems that truly integrate different types of information and perform reliably in the real world.