Why Claude and Other AI Models Keep Hallucinating Software Fixes: A New Study Has Answers

Q: What Did the Study Actually Test?

Sonatype evaluated roughly 37,000 unique package-version pairs and around 258,000 recommendations across seven frontier models . The test included Claude Sonnet 3.7 and 4.5, Claude Opus 4.6, Google's Gemini 2.5 Pro and 3 Pro, and OpenAI's GPT-5 and GPT-5.2. Each model received identical prompts asking it to recommend safe upgrade paths for software packages from Maven Central, npm, PyPI, and NuGet, which are the main repositories where developers download open source code. The researchers then validated every recommendation by checking it against Sonatype's package registry. If a model suggested a version that didn't actually exist, it was classified as a hallucination. If it recommended keeping the current version unchanged, that counted as inaction rather than a helpful suggestion.

Q: Why Does This Matter More Than You'd Think?

Software dependency management might sound like a niche technical problem, but it affects every company that uses open source code, which is essentially all of them. When developers need to update a library or framework to patch a security vulnerability, they rely on tools and AI to suggest safe upgrade paths. If those suggestions are wrong, developers waste time checking the recommendations, discarding invalid advice, and verifying whether suggested versions actually exist. This cleanup work increases token usage costs and creates friction in the development process . The stakes are higher than inconvenience. Leaving vulnerable code in place because an AI model was too cautious can expose companies to real security risks. Conversely, following a hallucinated recommendation can break a software build or introduce unexpected dependencies.

Q: What Did the Research Actually Find?

The central finding was striking: larger or newer models did not consistently produce the safest results when working without live software context . Even the strongest ungrounded systems still fabricated roughly one in 16 dependency recommendations. That means a model could confidently suggest a package version that doesn't exist in any repository. The study also found that newer models had become more cautious. Rather than risk hallucinating a bad recommendation, they increasingly suggested no change to an open source component. While this reduced hallucinations, it often left significant risk in place. Even the most cautious models still carried roughly 800 to 900 Critical and High severity vulnerabilities in their recommendations .

Q: How Much Better Is Real-Time Data Than Just Using a Bigger Model?

Sonatype compared standalone frontier models with what it called a hybrid approach: using real-time software intelligence to select the most secure available upgrade path. The difference was dramatic. Across Maven Central, npm, and PyPI, the hybrid approach delivered mean security score improvements of 269% to 309%, compared with 24% to 68% for the best-performing large language model in each ecosystem . Cost mattered too. In one test, a smaller grounded model produced 19 fewer Critical and 38 fewer High severity vulnerabilities than Claude Opus 4.6, while running at up to 71 times lower per-token inference cost . That's a significant advantage for enterprises managing thousands of dependencies.

Q: What Do Experts Say About This Problem?

"Larger models may be improving at reasoning, but dependency management is not a reasoning problem alone, it is a data problem. If a model does not know your actual environment, current vulnerability data, and the policies you operate under, it is just making educated guesses," said Brian Fox, co-founder and CTO of Sonatype. Fox's point cuts to the heart of why this study matters beyond software security. Many organizations are adopting AI to automate routine technical tasks, but they're often treating AI models as general-purpose problem solvers. The Sonatype research suggests that approach fails when the task depends on fast-changing external data, like software package registries and vulnerability databases that update constantly .

Q: What About Claude's New "Auto Mode"?

Separately, Anthropic has introduced an "auto mode" for Claude Code that allows the AI to make permission-related decisions independently, reducing the need for constant human approval during coding tasks . The auto mode uses an AI-powered classifier to assess risk levels before executing commands. It blocks potentially harmful actions like mass file deletion, malicious code execution, and sensitive data leaks, while automatically processing safe actions. If Claude Code encounters a risky task, it halts and seeks user permission . The feature is currently available as a research preview for Claude Team users, with rollout planned for API and Enterprise users. It supports Claude Opus 4.6 and Sonnet 4.6 . Anthropic has advised users to enable auto mode only in sandboxed or isolated environments to minimize risk. The Sonatype study adds to a growing debate over whether general-purpose AI models can be trusted to make low-level software maintenance decisions without access to live operational data . For businesses using AI to support developers, the research suggests reliability may depend less on choosing the biggest model and more on tying systems to current software supply chain information. This finding has implications beyond dependency management. Any AI application that depends on frequently updated external information, whether it's medical guidelines, regulatory requirements, or market data, may face similar challenges. The lesson is clear: in the real world, current data often beats raw model size.

FrontierNews.ai AI Research Desk

FrontierNews.ai