Three ways AI is learning to understand the physical world

by | Mar 20, 2026 | Technology

Large language models are running into limits in domains that require an understanding of the physical world — from robotics to autonomous driving to manufacturing. That constraint is pushing investors toward world models, with AMI Labs raising a $1.03 billion seed round shortly after World Labs secured $1 billion.Large language models (LLMs) excel at processing abstract knowledge through next-token prediction, but they fundamentally lack grounding in physical causality. They cannot reliably predict the physical consequences of real-world actions. AI researchers and thought leaders are increasingly vocal about these limitations as the industry tries to push AI out of web browsers and into physical spaces. In an interview with podcaster Dwarkesh Patel, Turing Award recipient Richard Sutton warned that LLMs just mimic what people say instead of modeling the world, which limits their capacity to learn from experience and adjust themselves to changes in the world.This is why models based on LLMs, including vision-language models (VLMs), can show brittle behavior and break with very small changes to their inputs. Google DeepMind CEO Demis Hassabis echoed this sentiment in another interview, pointing out that today’s AI models suffer from “jagged intelligence.” They can solve complex math olympiads but fail at basic physics because they are missing critical capabilities regarding real-world dynamics. To solve this problem, researchers are shifting focus to building world models that act as internal simulators, allowing AI systems to safely test hypotheses before taking physical action. However, “world models” is an umbrella term that encompasses several distinct architectural approaches. That has produced three distinct architectural approaches, each with different tradeoffs.JEPA: built for real-timeThe first main approach focuses on learning latent representations instead of trying to predict the dynamics of the world at the pixel level. Endorsed by AMI Labs, this method is heavily based on the Joint Embedding Predictive Architecture (JEPA). JEPA models try to mimic how humans understand the world. When we observe the world, we do not memorize every single pixel or irrelevant detail in a scene. For example, if you watc …

Article Attribution | Read More at Article Source