What Actually Counts as a World Model? Researchers Say Video Generators Like Sora Don't Qualify
Researchers from Peking University, Kuaishou Technology, and other institutions have drawn a sharp line in the sand: popular text-to-video models like OpenAI's Sora are not world models, despite claims from tech leaders suggesting otherwise. An international team has proposed the first standardized definition of what actually counts as a world model in artificial intelligence, along with an open-source framework called OpenWorldLib to help researchers build and test these systems .
Why Are Researchers Rejecting Video Generators as World Models?
The debate centers on a fundamental distinction between passive generation and active understanding. When OpenAI released Sora, many observers called it a "world simulator." Google DeepMind's CEO Demis Hassabis made similar claims about Google's Veo video model, positioning it as a step toward true world models. The new research challenges this framing directly .
According to the researchers' definition, a genuine world model must accomplish three core capabilities: perceive its environment through real-world data, interact with that environment through feedback loops, and retain memory across multiple interactions. Text-to-video generators fail on all three counts. A model that only generates videos from text prompts doesn't perceive anything; it doesn't interact with the world; and it has no way to learn from real-world feedback. The paper states that text-to-video therefore falls "outside the core tasks of world models" .
This aligns with the perspective of Yann LeCun, a prominent AI researcher at Meta, who has expressed similar skepticism about video generators being true world models. While these systems show some grasp of physical relationships, they lack the crucial feedback loop that distinguishes genuine world understanding from sophisticated pattern matching .
What Three Tasks Define a Real World Model?
Rather than passive media generation, the research team identified three specific task areas that represent genuine world modeling capabilities:
- Interactive Video Generation: A model predicts the next video frame based on previous frames and user input, reacting to control commands or camera movements rather than simply generating content from text.
- Multimodal Reasoning: The ability to understand spatial, temporal, and causal relationships from images, videos, and audio, such as determining where an object is located or why something happened.
- Vision-Language-Action: Converting visual input and voice instructions into specific movement commands for robotic arms or self-driving vehicles, creating a direct link between perception and real-world action.
The researchers also emphasized that 3D reconstruction and physics simulators are essential building blocks for world models. These provide testable environments where physical rules can be strictly enforced, unlike plain video prediction, which only offers a visual guess at the future without guaranteeing physical consistency .
How Does OpenWorldLib Help Researchers Build Better World Models?
The OpenWorldLib software project packages world model capabilities in a modular setup designed to standardize how researchers develop and evaluate these systems:
- Operator Module: Converts all kinds of inputs, including text, images, and sensor data, into a standardized format that other modules can process.
- Synthesis Module: Generates images, videos, audio, and control commands based on the model's understanding of the environment.
- Reasoning Module: Handles spatial, visual, and acoustic context to understand relationships between different elements in a scene.
- Representation Module: Builds 3D reconstructions and simulation environments where physical rules can be tested and verified.
- Memory Module: Stores interaction sequences so the system maintains consistency across multiple steps and learns from past interactions.
A top-level pipeline orchestrates all the modules and exposes a standardized interface, allowing researchers to compare different models and methods within the same framework rather than building custom infrastructure each time .
When the researchers tested existing models within their framework using Nvidia's A800 and H200 graphics processing units (GPUs), they found that Hunyuan-WorldPlay achieved the highest visual quality in interactive video generation for navigation scenes. Nvidia's Cosmos performed best in complex interactive scenarios requiring the model to handle a wide range of user inputs. However, models like VGGT and InfiniteVGGT showed clear weaknesses in 3D scene reconstruction, with significant camera movement leading to geometric inconsistencies and blurry textures .
What Hardware Limitations Are Slowing World Model Development?
The research team also identified a critical mismatch between current computer chip designs and what world models actually need. Modern processors are built to handle individual tokens, so even when a model needs to predict entire video frames, the data still gets processed token by token internally. This approach is wildly inefficient for the data-heavy perception that real-world models demand. The researchers argue that new chip architectures are needed, and possibly a move away from the Transformer architecture, which currently powers nearly every large AI model .
As a practical interim solution, the authors point to current vision-language models like Bagel, which handles both multimodal reasoning and image generation on the Qwen architecture. In their view, this demonstrates that language models pre-trained on internet data can in principle deliver all the necessary capabilities, even if building a complete world model remains a long way off. OpenWorldLib is available as an open-source project on GitHub, allowing the broader research community to contribute to world model development .
The implications of this research extend beyond academic definitions. By establishing clear criteria for what constitutes a world model, the team provides a roadmap for distinguishing genuine progress in AI understanding from impressive but ultimately limited content generation systems. This distinction matters for researchers, investors, and policymakers trying to assess where AI technology is actually headed.