Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]

Here's a summary of the key themes from the Hacker News discussion:

The Dominance of LLMs and the Eclipsed Status of Computer Vision

A central theme is the perception that Large Language Models (LLMs) have "sucked the entire energy out of computer vision," dominating job postings and research focus. This has led to a feeling of being "left out of it all while still doing great stuff" for those working on computer vision. There's a historical parallel drawn to when "ML itself sucked all the energy out of computer vision," suggesting a cyclical nature to research trends.

"It's hard to describe, but it's felt like LLMs have completely sucked the entire energy out of computer vision." - skwb
"Like... I know CVPR still happens and there's great research that comes out of it, but almost every single job posting in ML is about LLMs to do this and that to the detriment of computer vision." - skwb
"agreed about sucking the air out by LLM. The positive side is that its a good time to innovate in other areas while a chunk of ppl are absorbed in LLMs." - smath
"And I remember when ML itself sucked all the energy out of computer vision. Time to pay the piper." - glitchc
"It felt the same back in 2012-2015 when deep learning was flooding over computer vision." - whiplash451

LLMs vs. Spatial Intelligence: Limitations and Potential

A significant portion of the discussion revolves around the capabilities and limitations of LLMs concerning spatial reasoning and understanding the physical world. While LLMs excel at pattern recognition in text, they often struggle with precise spatial relationships, geometry, and understanding physical constraints.

"weinzierl: I tried LLM's for geolocation recently and it is both amazing how good they are at recognizing patterns and how terrible they are with recognizing and utilizing basic spatial relationships." - weinzierl
"Nobody controls a drone, missile or vehicle by taking a screenshot and sending it to ChatGPT and has it do math while it's on flight, anything that requires as the title of the thread says, spatial intelligence is unsuited for a language model" - Barrin92
"LLMs really suck at some basic tasks like counting the sides of a polygon." - porphyra
"weinzierl: ...the gist of it is that apparently it has no idea about spatial relationships.... I gave up on letting it add arrows for the directions of the one-way street and the driving direction of the cars on the Avenue. In at the end, letting it match that bird’s eye view against a map of Manhattan and finding the respective corner also did not work." - weinzierl
"ansgri: I've tried to use various OpenAI models for OpenSCAD code generation, and while the code was valid, it absolutely couldn't get spatial relationships right. Even in simple cases, like a U-tube assembled from 3 straight and 2 curved segments. So this is definitely an area for improvement." - ansgri
"AStrangeMorrow: Yeah even LLM generated code for a 2D optimization problems with many spatial relationships has being absolutely terrible, while I had great success in other domains." - AStrangeMorrow
"dopadelic: Yes, classic LLMs (like GPT) operate as sequence predictors with no inductive bias for space, causality, or continuity. They're optimized for language fluency, not physical grounding." - dopadelic

The Need for and Challenges of Real-World Spatial Understanding

The discussion highlights the critical need for AI systems to possess robust spatial intelligence for applications in robotics, autonomous vehicles, AR/VR, and more. However, achieving this involves significant challenges, particularly around data acquisition, representation, and processing.

"jandrewrogers: Most dynamics of the physical world are sparse, non-linear systems at every level of resolution. Most ways of constructing accurate models mathematically don’t actually work. LLMs, for better or worse, are pretty classic ( in an algorithmic information theory sense) sequential induction problems. We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch." - jandrewrogers
"lsy: Learning how the physical world works (not just kind of works a la videogames, actually works) is not only about a jillion times more complicated, there is really only one usable dataset: the world itself, which cannot be compacted or fed into a computer at high speed." - lsy
"psb217: I think there's an implicit assumption here that interaction with the world is critical for effective learning. In that case, you're bottlenecked by the speed of the world... when learning with a single agent. One neat thing about artificial computational agents, in contrast to natural biological agents, is that they can share the same brain and share lived experience, so the "speed of reality" bottleneck is much less of an issue." - psb217
"jgord: Robots that interact in the real world will need to make a 3D model in realtime and likely share it efficiently with comrades." - jgord
"m-s-y: lol. the world Is breakable. Any model based on it will need to know this anyway. Am I missing your argument?" - m-s-y (responding to loa_in_'s point about breakable objects)
"starchild3001: Her point about spatial intelligence being the next frontier after language really resonates. As someone who's been writing code for a while, it reminds me of the leap in complexity from single-threaded applications to concurrent, distributed systems. The state space just explodes." - starchild3001
"starchild3001: If you lean heavily on synthetic data, you're in a constant battle with the "sim-to-real" gap. ... If you lean on real-world capture (e.g., massive-scale photogrammetry, NeRFs, etc.), the MLOps and data pipeline challenges seem staggering." - starchild3001

Alternative Approaches and Future Directions (ML/RL, Multimodal Models, New Representations)

Despite the dominance of LLMs, there's a strong current of optimism and exploration into methods that go beyond standard text-based AI for spatial understanding. This includes a renewed appreciation for Reinforcement Learning (RL) and classical ML techniques, the development of multimodal models, and novel data representations.

"jgord: To me its totally obvious that we will have a plethora of very valuable startups who use RL techniques to solve realworld problems in practical areas of engineering .. and I just get blank stares when I talk about this :]" - jgord
"jgord: You can think of RL as an optimization - greatly speeding up something like monte carlo tree search, by learning to guess the best solution earlier." - jgord
"porphyra: I feel like 3D reconstruction/bundle adjustment is one of those things where LLMs and new AI stuff haven't managed to get a significant foothold. Recently VGGT won best paper which is good for them, but for the most part, stuff like NERF and Gaussian Splatting still rely on good old COLMAP for bundle adjustment using SIFT features." - porphyra
"pzo: thought there been a lot of progress in last 2 years. (Video) Depth Anything, SAM2, grounding Dino, DFINE, VLM, Gaussian splats, Nerf. Sure less than progres in LLm but still I would say progress accelerated with LLM research." - pzo
"dopadelic: But multimodal models like ViT, Flamingo, and Perceiver IO are a completely different lineage, even if they use transformers under the hood. They tokenize images (or video, or point clouds) into spatially-aware embeddings and preserve positional structure in ways that make them far more suited to spatial reasoning than pure text LLMs." - dopadelic
"jgord: I have no doubt that we are on the brink of a massive improvement in 3D processing. Its clearly solvable with the ML/RL approaches we currently have .. we dont need AGI." - jgord
"jgord: Perhaps the points Im trying to make are : - the normal techniques are useful but not quite enough [ heuristics, classical CV algorithms, colmap/SfM ] - NeRFs and gaussian splats are amazing innovations, but dont quite get us there - to solve 3D reconstruction, from pointclouds or photos, we need ML to go beyond our normal heuristics : 3D reality is complicated - ML, particularly RL, will likely solve 3D reconstruction quite soon, for useful things like buildings" - jgord
"KaiserPro: Humans don't really keep complex dense reconstructions in our head. Its all about spatial relationships of landmarks." - KaiserPro
"sega_sai: I feel that if words/phrases/whole texts can be embedded well in high dimensional spaces as points, the same must apply to the 3d world. I'm sure there will be embeddings of it (i.e. mapping the 3-d scene into a high-D vector) and then we'll be work with those embeddings as LLMs work with text" - sega_sai
"voxleone: I'm trying to approach spatial reasoning by introducing quaternions to navigate graphs. It is a change in the unit of traversal — from positional increment to rotational progression. This reframing has cascading effects. It alters how we model motion, how we think about proximity, and ultimately how systems relate to space itself." - voxleone
"jandrewrogers: For classic scalar data models, representations that preserve the relationships have the same dimensionality as the underlying data model. A set of points in 2-dimensions can always be represented in 2-dimensions such that they satisfy the cutting problem (e.g. a quadtree-like representation). For non-scalar types like rectangles, operations like equality and intersection are distinct and there are an unbounded number of relationships that must be preserved that touch on concepts like size and aspect ratio to satisfy cutting requirements. The only way to expose these additional relationships to cutting algorithms is to encode and embed these other relationships in a (much) higher dimensionality space and then cut that space instead." - jandrewrogers

The Nature of Intelligence and Learning

A philosophical layer to the discussion concerns the fundamental nature of intelligence, learning, and how it relates to embodiment, environment, and innate predispositions.

"myspeed: Most of our spatial intelligence is innate, developed through evolution. We're born with a basic sense of gravity and the ability to track objects. When we learn to drive a car, we simply reassign these built-in skills to a new context" - myspeed
"pzo: Is there any research about it ? This would mean we massing some knowledge in genes and when offspring born have some knowledge of our ancestors. This would mean the weights are stored in DNA?" - pzo
"cma: Horses can be blindfolded at birth and when removed do basic navigation with no time for any training. Other non-visually precocious animals like cats, if they miss a critical development period without getting natural vision data, will never develop a functioning visual system. Baby chicks can do bipedal balance pretty much as soon as they dry off." - cma
"moktonar: Intelligence is not only embodied (it needs a body), it is also embedded in the environment (it needs the environment). If you want an intelligence in your computer, you need an environment in your computer first, as the substrate from which the intelligence will evolve." - moktonar
"epr: Human beings get by quite well with extremely oversimplified (low resolution) abstractions. There is no need whatsoever for something even approaching universal or perfect. Humans aren't thinking about fundamental particles or solving differential equations in their head when they're driving a car or playing sports." - epr
"alganet: How can I be sure that spatial intelligence AIs will not be just intricate sensoring that ultimately fails to demonstrate actual intelligence?" - alganet