Sora and Veo Aren't World Models, Researchers Say. Here's What Actually Counts as One.

A new framework from researchers at Peking University, Kuaishou Technology, and other institutions draws a sharp line between video generation and true world models. The key difference: real world models must perceive their environment, interact with it, and retain memory over time, while text-to-video systems like Sora and Veo lack the crucial feedback loops needed to understand how the physical world actually behaves .

The research team has released OpenWorldLib, an open-source project designed to standardize how world models are built and evaluated. This matters because the term "world model" has been used loosely across AI research for years, with companies like OpenAI and Google claiming their video generators were stepping stones toward world models. The new framework challenges that narrative directly.

Why Text-to-Video Models Don't Qualify as World Models?

When OpenAI launched Sora, many observers called it a "world simulator." Google's Deepmind CEO Demis Hassabis made similar claims about the company's Veo video model. But the researchers behind OpenWorldLib disagree fundamentally. While video generation shows some understanding of physical relationships, it's missing something essential: the model never perceives its environment or interacts with it in real time .

A text-to-video system takes a written prompt and generates video frames. That's passive generation, not active learning. A true world model, by contrast, needs to test its predictions against reality, adjust its understanding, and improve based on what actually happens. Without that feedback loop, the researchers argue that text-to-video falls "outside the core tasks of world models." They also exclude code generation, web search, and avatar video generation from their definition, since these lack grounding in physical understanding .

What Three Capabilities Define a Real World Model?

The research team identified three core task areas that distinguish genuine world models from passive media generators. These capabilities form the foundation of OpenWorldLib's framework and represent what researchers believe AI systems must master to truly understand how the world works .

  • Interactive Video Generation: The model predicts the next video frame based on previous frames and user input, reacting to control commands or camera movements rather than simply generating content from text.
  • Multimodal Reasoning: The system figures out spatial, temporal, and causal relationships from images, videos, and audio, understanding where objects are located and why events occur.
  • Vision-Language-Action: The model converts visual input and voice instructions into specific movement commands for robotic arms or self-driving vehicles, bridging perception and physical action.

Beyond these three areas, the researchers view 3D reconstruction and simulators as essential building blocks. These provide testable environments where physical rules can be strictly enforced, unlike plain video prediction, which only offers a visual guess at the future without guaranteeing physical consistency .

How OpenWorldLib Structures World Model Development

The OpenWorldLib software project packages these capabilities in a modular setup that researchers can use to build and compare different approaches. The framework includes five key modules that work together to support world model development .

  • Input Processing Module: Converts all kinds of inputs, including text, images, and sensor data, into a standardized format that the system can work with consistently.
  • Synthesis Module: Generates images, videos, audio, and control commands based on the model's understanding of the world.
  • Reasoning Module: Handles spatial, visual, and acoustic context to understand relationships between different elements in a scene.
  • Representation Module: Builds 3D reconstructions and simulation environments where physical rules can be tested and validated.
  • Memory Module: Stores interaction sequences so the system maintains consistency across multiple steps and learns from past experiences.

A top-level pipeline orchestrates all the modules and exposes a standardized interface. This approach lets researchers compare different models and methods within the same framework instead of building custom infrastructure for each evaluation .

When the team tested existing models inside their framework using Nvidia's A800 and H200 graphics processing units (GPUs), they found that Hunyuan-WorldPlay achieved the highest visual quality in interactive video generation for navigation scenes. Nvidia's Cosmos performed best in complex interactive scenarios where the model had to handle a wide range of user inputs. Older approaches like Matrix-Game-2 were faster but showed noticeable color drift in longer sequences .

Models like VGGT and InfiniteVGGT revealed clear weaknesses in 3D scene reconstruction. Significant camera movement led to geometric inconsistencies and blurry textures. Despite these limitations, the researchers consider 3D generation essential to the future of world models .

Could Hardware Be Holding World Models Back?

The research team also critiques current chip design, arguing that modern processors are fundamentally mismatched with what world models need. Today's chips are built to handle individual tokens, which are small units of data. Even when a model needs to predict entire video frames, the data still gets processed token by token internally. For perception-heavy tasks that world models require, this approach is wildly inefficient .

The authors suggest that new chip architectures are needed, and possibly a move away from the Transformer architecture, which currently powers nearly every large AI model. As a practical stopgap, they point to current vision-language models like Bagel, which handles both multimodal reasoning and image generation on the Qwen architecture. This demonstrates that language models pre-trained on internet data can in principle deliver all the necessary capabilities, even if building a complete world model remains a long way off .

OpenWorldLib is available as an open-source project on GitHub, allowing researchers worldwide to contribute to the development of true world models and move beyond the limitations of text-to-video generation systems.