Spatial Intelligence & World Models
Why current LLMs may not be sufficient for human-level intelligence, and what some others have been exploring instead.
Ever since I visited Manycore in Hangzhou last summer, I’ve (lowkey) been obsessed with the idea of spatial intelligence and world models.
I’ve spent some time trying to understand why that’s the case, and I’ve come to the conclusion that perhaps it’s because I’m fundamentally of the belief that LLMs in their current state will never be enough to achieve human level intelligence (and are fundamentally flawed for this purpose — I won’t get into too much detail as to why this is the case here).
Don’t get me wrong, they are and will continue to be useful for many things, but just not for the purposes of reaching human level intelligence.
Intuitively, perhaps the reason why the concept of a world model sounds so appealing is because it has (in theory) an astute internal representation of our world (physical reality), kind of like a physics engine which understands causation (a topic which I’ve had the pleasure of exploring more deeply as well this semester with NHS2088: The Baby, the philosopher and the cognitive scientist 😇).
For a while now, frontier labs have been trying to achieve this through different approaches (with the most popular probably being Vision Language Models (VLMs) or Multimodal Large Language Models (MLLMs) currently).
(ByteDance Seed has a paper on how spatial abilities branch out in MLLMs here.)
Some of these labs include:
- SpatialVerse under Manycore with their release of SpatialLM and SpatialGen
- Fei-Fei Li’s World Labs (which has an insane valuation of $5B) with Marble, which is a generative model that uses Gaussian splatting to create interactive 3D environments
- Google DeepMind with their Genie family of autoregressive models
- Runway with their general world model GWM-1, which is an autoregressive diffusion model built on top of Gen-4.5
- NVIDIA with Cosmos, their world foundation model platform
- General Intuition, which is a frontier research lab dedicated to building foundation models for environments that require deep spatial and temporal reasoning
Yann LeCun has also proposed abstract latent space representations and energy-based models with Joint-Embedding Predictive Architecture (JEPA). Notably his new startup AMI Labs (which raised $1B in seed funding not too long ago) has been brewing for a while, continuing the exploration of JEPA-based models.
If you wanna have a better understanding of his philosophy, feel free to check out these links:
- LeWorldModel: Stable End-to-End JEPA training
- What Is Yann LeCun Cooking? JEPA Explained Simply
- Stanford CS25: Transformers United V6 | From Representation Learning to World Modeling
- Yann LeCun | Self-Supervised Learning, JEPA, World Models, and the future of AI
TLDR: Babies take in way more information in their first few years through their visual cortex and sensory interaction than LLMs do through text scraped from the web. JEPA learns to predict abstract representations of the world which allows for more intuitive physics and common sense.
There have also been some approaches inspired by neuroscience like proponents of Karl Friston’s Free Energy Principle who try to map Active Inference to model architectures, evolutionary strategies and so on but I think I’ll leave it at that.
So, where does this leave us?
One thing that keeps coming back to mind is Embodied Robotics. For a robot to be truly intelligent and capable, it needs to be able to simulate or predict physical outcomes. A robot that can be trained in simulated worlds or equipped with a model that gives it strong world-modelling capabilities might just have the ability to achieve human level intelligence.