World Models Are Becoming the Missing Piece in Physical AI, and China Is Racing to Build Them
World models represent a fundamental breakthrough for physical AI because they teach machines to understand and predict how the real world actually works, rather than relying solely on language understanding. Unlike large language models (LLMs) trained primarily on text, world models are built on video and real-world physical scenarios, enabling robots and autonomous systems to perceive their surroundings and interact with them autonomously. Alibaba Cloud's $290 million investment in ShengShu, the startup behind the AI video generation tool Vidu, underscores how seriously the industry is taking this shift .
Why Are World Models Different From ChatGPT-Style AI?
The distinction matters enormously for robotics. Large language models like ChatGPT excel at understanding and generating text, but they lack something crucial: an intuitive grasp of how physical objects move, interact, and respond to forces. A robot picking up a fragile object needs to understand pressure, balance, and momentum. A self-driving car needs to predict how pedestrians will move. These tasks require a different kind of intelligence altogether .
ShengShu, founded three years ago, is building what it calls a "general world model" that bridges two currently separate domains: the digital world of games and AI-generated video, and the physical world of autonomous driving and robots. The company's latest Vidu Q3 Pro model, released in January, ranks among the top 10 AI models for generating videos from text and images, according to Artificial Analysis .
"ShengShu believes that a general world model, built on multimodal data such as vision, audio, and touch, more naturally captures how the physical world works than large language models," the startup stated in announcing the funding round.
ShengShu, company statement
The funding round, announced Friday, included participation from TAL Education and Baidu Ventures alongside Alibaba Cloud. This comes just two months after ShengShu raised 600 million yuan from Qiming Venture Partners and other backers, signaling accelerating investor confidence in the world model space .
What Makes World Models Essential for Robots?
Kevin Kelly, co-founder of the tech magazine Wired, recently outlined why world models are critical infrastructure for advanced AI. To replicate human intelligence, AI will ultimately need three capabilities: reasoning, an understanding of the physical world, and continuous learning. While LLM-powered chatbots have created the knowledge element, world models represent the key area requiring a breakthrough .
Zhu Jun, founder of ShengShu, explained the company's vision for connecting perception and action. "We aim to connect perception and action," he stated, "allowing AI systems to better model and predict real-world behavior consistently." This capability is foundational for embodied AI systems like humanoid robots that must operate in human environments .
The market opportunity is substantial. According to Counterpoint Research, cumulative physical AI device shipments, including vehicles, robots, and drones, will reach 145 million units during 2025 to 2035 . Within robotics specifically, service robots will account for the largest shipment volumes, driven by expanding use cases across logistics, warehouses, hospitality, healthcare, cleaning, security, and agriculture .
How Is the Broader Ecosystem Responding to World Models?
- Alibaba's Multi-Front Investment Strategy: Beyond ShengShu, Alibaba led a $50 million investment in Tripo AI, a platform that uses AI to generate 3D models from photographs, and a $60 million investment in PixVerse, which released an AI world model allowing users to direct how videos unfold during generation .
- Strategic Partnerships With Embodied AI Companies: ShengShu has announced strategic partnerships with companies developing embodied AI systems such as humanoid robots for use across industrial, commercial, and home settings .
- Open-Source Development: Alibaba has released free, open-source AI models for video generation and, in February, launched one specifically for powering robots, democratizing access to world model technology .
The humanoid robot segment is expected to be the fastest-growing category in terms of shipments, with cumulative installations projected to exceed 100,000 units by 2028, representing a sevenfold increase compared with 2025 . However, the industry faces a significant challenge: crossing what researchers call the "chasm" from autonomous machine intelligence to embodied artificial general intelligence (AGI) .
"While there are advancements in the 'form', the 'mind' is something that is ripe for innovation," noted Neil Shah, vice-president at Counterpoint Research, emphasizing that advances in generative AI, computer vision systems, and motion control are bringing the industry closer to general-purpose robots that can operate in human environments.
Neil Shah, Vice-President at Counterpoint Research
Commercial drones, excluding consumer and defense drones, are emerging as the earliest large-scale deployment of physical AI, with rapid adoption across logistics, surveillance, and enterprise use cases driving high-volume growth . Autonomous vehicles at Level 4 and above autonomy levels are expected to see slower initial volumes, but the expansion of robotaxis and autonomous personal vehicles could significantly scale adoption over time .
What Does This Mean for the Physical AI Ecosystem?
As physical AI systems scale across industries, collaboration across original equipment manufacturers (OEMs), semiconductor makers, connectivity providers, and software companies will be critical to unlock their full potential. Companies that can build strong platforms and partnerships across the value chain will be best positioned to capture this emerging opportunity .
The rise of vision-language models and vision-action models will unify multimodal perception, language understanding, reasoning, and executable control within a single sequence modeling framework, representing what researchers call "a critical inflection point" for the field . This convergence suggests that the next generation of physical AI will combine the language understanding capabilities of modern LLMs with the real-world modeling capabilities of world models, creating systems that can both reason about the world and act within it effectively.
Beyond device makers, compute players will benefit by powering the "brains" of these systems. Telecom operators will gain from increased data traffic, connectivity, and edge services. Meanwhile, software and services providers will see recurring revenue opportunities through data analytics, lifecycle management, fleet services, and cloud infrastructure .