AI Models Can't Hide Their Thinking, Even When They Try: Why That's Actually Good News for Safety

Current AI reasoning models are remarkably bad at concealing how they think, and researchers say that's exactly what we want. A new study from OpenAI found that when asked to control their chain-of-thought reasoning, frontier AI models failed to comply with instructions more than 85% of the time, even when they understood they were being monitored .

What Is Chain-of-Thought Reasoning and Why Does It Matter?

Chain-of-thought reasoning refers to the step-by-step explanations that AI models generate while solving problems. Instead of jumping straight to an answer, these systems show their work, much like a student writing out math steps on a test. This transparency helps researchers understand how models arrive at conclusions and identify potential safety issues .

The reasoning traces serve as a safety monitoring tool. When an AI agent is working on a task, its intermediate reasoning steps can reveal whether the system is behaving as intended or if something has gone wrong. This is especially important as AI systems take on more complex and autonomous tasks in the real world .

Can AI Models Control What They Reveal About Their Thinking?

Researchers at OpenAI introduced a new evaluation framework called CoT-Control to test whether reasoning models could deliberately reshape or obscure their thinking when they knew they were being monitored. The team created over 13,000 test tasks by pairing established benchmark problems with specific instructions about how models should format their reasoning .

The instructions were straightforward: use only lowercase letters, avoid certain keywords, add specific phrases to each sentence, or follow other formatting rules. These weren't tricks or deception tests; they were simply requests to control the style or content of the reasoning process itself .

The results were striking. Across 13 frontier reasoning models tested, controllability scores ranged from 0.1% to a maximum of 15.4%. In other words, models failed to follow these simple instructions about their own reasoning more than 84% of the time .

Why Models Fail Even When They Know They're Failing

What makes these results particularly interesting is that models often recognized their own failures. In many cases, a model would generate reasoning that explicitly acknowledged it was supposed to follow a specific instruction, then continue to violate that instruction anyway. One example showed a model repeatedly using a forbidden word while simultaneously panicking about having already used it, unable to correct course despite recognizing the problem .

This inability isn't due to misunderstanding. The models comprehended the task; they simply couldn't execute it reliably. This suggests a fundamental limitation in how these systems process and control their own reasoning outputs.

How Does Model Size and Training Affect Reasoning Control?

The research revealed several patterns in how different factors influence controllability:

  • Model Size: Larger models showed somewhat better ability to control their reasoning compared to smaller ones, though this advantage plateaued at the largest scales tested.
  • Reasoning Duration: When models were allowed to think longer and generate more reasoning steps, their ability to control those chains of thought actually decreased significantly.
  • Training Methods: Models that underwent reinforcement learning training saw dramatic drops in controllability. One open-weight model's controllability score dropped by more than a factor of ten over the course of RL training .

These findings suggest that the more sophisticated and capable a model becomes, the harder it is for that model to maintain control over its reasoning process. This counterintuitive result has important implications for safety monitoring as AI systems advance.

What Do These Findings Mean for AI Safety?

The inability of current models to control their chains of thought is being interpreted as good news by the AI safety community. If models can't deliberately hide or reshape their reasoning even when they try, then monitoring systems can rely on those reasoning traces as trustworthy signals of what the model is actually doing .

However, researchers emphasize that this doesn't mean chain-of-thought reasoning is a perfect window into model behavior. Other research has shown that the explanations models generate don't always perfectly reflect the internal factors that influenced their decisions. Models can produce logical and detailed explanations while omitting certain influences that shaped their final answer .

The reasoning text is generated as part of the output process rather than being a direct readout of internal computations. Because of this, explanations may sound coherent while still missing some of the actual factors that drove the model's response .

How Can Organizations Use These Insights for Better AI Monitoring?

While the current inability of models to control their reasoning is reassuring, researchers recommend a layered approach to AI safety that doesn't rely on chain-of-thought monitoring alone:

  • Defense-in-Depth Strategy: Combine multiple complementary safeguards including safety training, behavioral testing, code review systems, and chain-of-thought monitoring rather than relying on any single approach.
  • Continuous Evaluation: As models become more capable, continued testing of controllability and monitorability is essential to catch emerging risks before they become problems.
  • Cautious Interpretation: Treat chain-of-thought explanations as useful signals rather than perfect representations of internal reasoning, and cross-reference them with other monitoring techniques.

OpenAI researchers noted that while current reasoning models struggle to control their chains of thought, this situation may not persist indefinitely. Future models might develop greater controllability, which could undermine the reliability of chain-of-thought monitoring as a safety tool .

What Happens as AI Systems Become More Advanced?

The research highlights a broader challenge in AI development: understanding how complex neural networks arrive at their decisions and maintaining reliable oversight as capabilities advance. As AI systems take on increasingly autonomous and complex tasks, the ability to monitor and understand their behavior becomes more critical .

Researchers from companies including Anthropic, OpenAI, Google DeepMind, and Meta are actively studying how to improve transparency and interpretability in AI systems. Some training methods show promise in making models more transparent about their reasoning, though improvements often reach a plateau where additional training doesn't yield better results .

The current findings suggest that the window of time when chain-of-thought monitoring is most reliable may be limited. As models grow more capable, they may eventually develop the ability to control their reasoning in ways that reduce monitorability. This makes the present moment particularly important for establishing robust safety practices before such capabilities emerge.