Three competing architectural approaches are emerging to give AI systems genuine understanding of physics and causality, addressing a fundamental limitation of today's language models. As companies deploy artificial intelligence (AI) beyond text generation and into robots, vehicles, and factories, the weaknesses of large language models (LLMs) are becoming impossible to ignore. While LLMs excel at pattern-matching text, they lack grounded understanding of how the physical world actually works. This gap is now driving multibillion-dollar investments into "world models," specialized AI architectures designed to let machines reason about and act within physical environments. The scale of investment underscores how critical this shift has become. AMI Labs recently closed a $1.03 billion seed round, while Fei-Fei Li's World Labs raised $1 billion, signaling that world models have become a strategic priority for robotics, autonomous driving, manufacturing, healthcare, and spatial computing. Yet "world model" is not a single technology. Instead, three distinct architectural families are emerging, each making different tradeoffs around real-time control, spatial fidelity, and computational cost. Why Do Today's AI Models Struggle With Physical Reasoning? The problem is fundamental to how LLMs work. These models succeed by predicting the next token, or word fragment, in a sequence of text. That mechanism is powerful for capturing abstract knowledge encoded in language, but it does not force the model to build an internal model of physics or causality. As a result, LLMs and even vision-language models (VLMs) struggle when actions must have reliable real-world consequences, from moving a robot arm to navigating an intersection. Turing Award winner Richard Sutton summarized this limitation in a conversation with podcaster Dwarkesh Patel, explaining that LLMs essentially mimic what people say rather than modeling the world itself. That means they have limited capacity to truly learn from embodied experience or respond robustly to changes that were not in their training data. Google DeepMind CEO Demis Hassabis has described the result as "jagged intelligence": today's models can solve math olympiad problems yet fail at basic physical reasoning. They can talk about friction and momentum, but they do not reliably predict what happens if a robot pushes a box on a wet floor. These gaps become visible in production systems. VLM-based agents can be brittle, failing under minor lighting changes or small shifts in camera angle. For robotics, automotive, and safety-critical workflows, that fragility is untenable. How Are Three Different World Model Architectures Solving This Problem? World models aim to address this by giving AI systems an internal simulator they can query: a structured representation of how entities, geometry, and forces evolve over time. But there are competing visions of what this simulator should look like and how tightly it should be coupled to perception and control. - JEPA-Style Latent World Models: Used by AMI Labs, these models learn compact latent representations that encode the underlying dynamics of a scene without trying to predict every pixel. They focus on variables that matter for prediction and control, similar to how humans intuitively process the world. - Gaussian Splat Spatial Generative Models: World Labs epitomizes this approach, using generative models that construct full 3D environments from images or text descriptions. The output is a navigable 3D scene that can be imported directly into physics engines like Unreal Engine. - End-to-End Generative Simulators: A third approach fuses world modeling, rendering, and physics into a single generative system, combining elements of both previous approaches into a unified architecture. The first major approach, used by AMI Labs, is built around the Joint Embedding Predictive Architecture (JEPA). Instead of predicting pixels, JEPA-style world models learn compact latent representations that encode the underlying dynamics of a scene. The design mirrors how humans intuitively process the world. When watching a car drive down a street, we track speed, direction, and likely trajectory, not the exact pattern of highlights on each leaf in the background. We focus on the variables that matter for prediction and control. For practitioners, JEPA models yield several concrete properties. Because the model does not try to reconstruct all pixels, small visual or background changes are less likely to break behavior. Operating in a low-dimensional latent space reduces both training data needs and inference latency. Low latency and compact state make JEPA architectures attractive for robotics, autonomous driving, and other domains where real-time responsiveness is critical. AMI is already exploring these properties in high-pressure operational settings. In partnership with healthcare company Nabla, AMI is applying JEPA-style models to simulate operational complexity and reduce cognitive load in fast-paced clinical environments, a scenario where systems must react quickly and reliably to changing conditions. "JEPA-based world models are designed to be controllable: they can be given goals and, by construction, are oriented toward accomplishing those goals," explained Yann LeCun, a key architect of JEPA and co-founder of AMI. Yann LeCun, Co-founder of AMI Labs The tradeoff is that JEPA prioritizes abstract dynamics over visual or spatial richness. If your primary need is fine-grained, photorealistic environment construction, other approaches may fit better. But if you care about control performance and latency, JEPA-like architectures are emerging as a compelling default. The second architectural family focuses less on fast control and more on spatial fidelity. World Labs, founded by Fei-Fei Li, epitomizes this track through generative models that construct full 3D environments using Gaussian splats. A Gaussian splat is a way of representing a 3D scene as a large collection of tiny, parameterized particles. Each particle encodes position, shape, color, and how it interacts with light. When rendered together, these particles form detailed 3D scenes that can be viewed from arbitrary viewpoints. World Labs' systems take an initial prompt, an image or a textual description, and generate a 3D Gaussian splat representation. Unlike conventional video generation, the output is a navigable 3D scene that can be imported directly into physics and 3D engines such as Unreal Engine. Human users or AI agents can then move through the environment, interact with objects, and run simulations from any camera angle. The core benefit is a drastic reduction in time and one-time generation cost for complex, interactive 3D environments. This directly targets what Fei-Fei Li has called the "wordsmiths in the dark" problem: LLMs have rich linguistic capabilities but lack spatial experience and physical context. World Labs' Marble model is designed to give AI agents that missing spatial awareness by embedding them in generated 3D worlds. For technical teams, this approach suggests several use cases. Spatial computing and extended reality (XR) applications can quickly generate immersive 3D environments for headsets or spatial interfaces. Interactive entertainment can build game-like worlds or simulations that agents and players can explore. Industrial and architectural design teams can create rich digital twins or concept spaces far faster than manual modeling. And robotics teams can produce diverse virtual environments in which embodied agents can be trained offline. This approach is less focused on split-second real-time control than on high-fidelity, navigable worlds. That orientation aligns with enterprise backers such as Autodesk, which has heavily supported World Labs to integrate these models into industrial design workflows. For design and simulation teams, this kind of "world factory" offers a way to rapidly iterate environments before physical prototyping. What Makes Each Approach Different From the Others? The three architectural families represent fundamentally different engineering philosophies. JEPA-style models prioritize speed and controllability by working in abstract, low-dimensional spaces. Gaussian splat models prioritize visual and spatial fidelity by generating detailed 3D scenes. The third approach, end-to-end generative simulators, attempts to fuse these strengths into a single system that can both render and simulate physics in one pass. Each tradeoff reflects different use cases and priorities. A robotics team deploying a manipulator arm in a factory needs fast, reliable control; JEPA-style models fit that need. A game studio or architectural firm building immersive environments needs visual richness and navigability; Gaussian splat models excel there. A research team exploring how to combine perception, simulation, and physics might benefit from an end-to-end approach that unifies all three. The competition between these approaches is not zero-sum. Industry observers expect that hybrid stacks will emerge, combining JEPA-style latent representations for control with Gaussian splat rendering for visual fidelity, all coordinated by LLMs that provide high-level reasoning and planning. This convergence could create a new generation of AI systems that combine the abstract reasoning power of language models with the physical grounding of world models. The multibillion-dollar funding rounds and strategic partnerships underscore that world models are no longer a research curiosity. They are becoming essential infrastructure for deploying AI into the physical world. As robotics, autonomous vehicles, and manufacturing systems become more sophisticated, the ability to reason about physics and causality will separate systems that work reliably from those that fail when conditions change. The three competing architectures represent different paths to solving that problem, and the next few years will reveal which approaches scale, which combinations work best, and how quickly these technologies can move from research labs into production systems.