The Grounding Problem: Can Understanding Exist Without a World?

Imagine a dictionary written entirely in a language you don't speak. Every word is defined using other words in that same language. You can follow the cross-references indefinitely, from one entry to another to another, and you'll never learn what any of them mean. You'll learn that word A is defined in terms of words B and C, and that word B is defined in terms of words D and E, but you'll never connect any of them to the world. The definitions are circular. The symbols refer only to other symbols.

In 1990, the cognitive scientist Stevan Harnad formalized this intuition as the "symbol grounding problem": how can the meanings of symbols be made intrinsic to a system, rather than parasitic on the meanings in an external interpreter's head?^[1] Harnad's formulation explicitly connects to Searle's Chinese Room: the person in the room manipulates symbols that refer to other symbols, but none of them connect to anything in the world. The symbols are ungrounded. They float free of reality.

Large language models are trained on text. Text is symbols defined in terms of other symbols. The training data is, in a sense, the largest dictionary ever compiled: billions of instances of humans using words to describe, explain, argue, narrate, and define, all in terms of other words. The model learns the statistical relationships between these symbols with extraordinary precision. But has it ever touched the world those symbols describe?

The Embodiment Thesis

Two halves of an image: on the left, a figure interacting with a rich physical world of objects and sensations; on the right, the same concepts represented only as floating text and symbols in an empty void — Can you understand a world you've never touched?

The embodiment thesis holds that genuine understanding requires a body that interacts with the physical world. You can't understand "hot" without having experienced heat. You can't understand "heavy" without having lifted something. You can't understand "red" without having seen it. Understanding isn't a relationship between symbols. It's a relationship between symbols and the experiences they represent.

On this view, LLMs are fundamentally limited. They know about heat from millions of sentences describing heat, but they've never been burned. They know about color from descriptions of color, but they've never seen anything. They know about weight, texture, smell, and pain entirely through other people's reports of these experiences. Their knowledge is propositional (knowledge that) rather than experiential (knowledge of).

Frank Jackson's "Mary's Room" thought experiment sharpens this point. Mary is a brilliant neuroscientist who knows every physical fact about color vision, the wavelengths, the neural processes, the perceptual mechanisms. But she's lived her entire life in a black-and-white room. When she finally sees red for the first time, does she learn something new?^[2] Most people intuit that she does. She gains experiential knowledge that all her propositional knowledge couldn't provide.

If Mary learns something by experiencing red that she couldn't learn from complete physical descriptions of red, then LLMs, which have access only to descriptions and never to experiences, may be permanently missing something that matters for genuine understanding.

Grounding Once Removed

The counter-argument is subtle: LLMs are grounded, just not directly. They're grounded in human language, which is itself grounded in human experience. The training data isn't abstract symbol manipulation. It's the accumulated record of billions of humans describing their embodied experiences. When someone writes "the coffee burned my tongue," that sentence is grounded in the writer's experience of heat, surprise, and pain. The model learns from millions of such grounded descriptions.

Is that sufficient? The model has access to the linguistic traces of embodiment without embodiment itself. It knows what people say about being burned without ever being burned. It's learning from a map drawn by people who walked the territory. The map is extraordinarily detailed, drawn by billions of cartographers over decades. But it's still a map, not the territory.

The pragmatist response: if the map is detailed enough, and if the model uses it correctly in every context, does the distinction between map and territory matter functionally? A model that correctly predicts every consequence of being burned, that uses "hot" appropriately in every sentence, that can reason about thermal dynamics and pain responses and behavioral reactions, is that model missing something important? Or is it missing something metaphysically interesting but practically irrelevant?

Multimodal AI: Partial Grounding?

The grounding debate has shifted with the advent of multimodal models. Systems that process images alongside text, that can describe visual scenes, answer questions about photographs, and reason about spatial relationships, have a form of perceptual access that text-only models lack.

A robotic hand reaching toward a glowing flame, sensors on the fingertips registering data, the boundary between physical experience and data processing made visible — The robot registers a sensor value. Does it experience heat, or merely process a signal?

Does seeing images of fire give a model something closer to understanding "hot" than text alone provides? The image shows the visual appearance of fire. It doesn't convey the sensation of heat. It's a richer symbol, a visual one rather than a linguistic one, but it's still a symbol. The model sees what fire looks like without feeling what fire does.

Google's PaLM-E took this further by connecting a language model directly to robotic sensors and actuators, creating what the researchers called an "embodied multimodal language model."^[3] PaLM-E can plan actions for a physical robot based on natural language instructions, integrating visual perception, proprioceptive state, and language into a single model. Google DeepMind's RT-2 demonstrated that a vision-language model can translate natural language commands directly into robotic actions, showing generalization to novel objects and commands it wasn't explicitly trained on.^[4]

These systems represent a meaningful step toward grounding. They interact with physical environments. They receive sensory feedback. Their "words" connect to physical actions with physical consequences. If the robot reaches for a hot object and its temperature sensor registers damage, the system has something closer to embodied experience than a text-only model ever will.

But critics would note: the robot doesn't experience heat. It registers a sensor value. The question is whether there's a meaningful difference between experiencing heat and processing a signal that says "high temperature detected," or whether experience just is the processing of such signals at sufficient complexity.

The Developmental Argument

Human understanding develops through years of sensorimotor interaction before language arrives. Infants spend roughly twelve months exploring the physical world through touch, taste, movement, and observation before producing their first words. Their language, when it comes, is layered on top of a rich foundation of embodied experience. Words acquire meaning by being associated with experiences the child has already had.

AI systems skip this entirely. They begin with language, fully formed, and never pass through the developmental stage where symbols connect to sensory experience. They're born as adults who've read every book but never left the library.

The question is whether this matters. Does understanding require the developmental path, or just the end state? A child who learns "ball" by holding one and a model that learns "ball" from a million descriptions of balls end up with different relationships to the concept. But if their behavior with respect to the concept is identical in every testable context, does the different developmental path matter?

The Blind Scholar Problem

A spectrum bar transitioning from dark empty space on the left through increasingly rich sensory representations to a fully embodied figure interacting with a vibrant physical world on the right — Perhaps grounding isn't binary. Understanding may exist on a spectrum from pure symbol manipulation to full embodied experience.

There's a complication that cuts against strict embodiment requirements. Blind individuals develop rich understanding of concepts like color through linguistic and social interaction alone. A person blind from birth can use color terms correctly, reason about color relationships, and understand color metaphors. They lack the sensory experience of seeing red, but they clearly understand something about redness. Their understanding is grounded in language and social context rather than direct visual experience.

This suggests grounding may be more flexible than the strict embodiment thesis implies. If a blind person can genuinely understand color through linguistic interaction alone, then perhaps the route to understanding doesn't require direct sensory access to every concept. Linguistic grounding, learning a concept through its relationships to other concepts and to social contexts, may be sufficient for some forms of understanding, even if it produces a different kind of understanding than sensory grounding provides.

This doesn't fully resolve the LLM case, because a blind person has other forms of embodied experience (touch, sound, proprioception, emotion) that contextualize their linguistic learning. They aren't disembodied in the way an LLM is. But it demonstrates that the relationship between experience and understanding is more complex than "no experience, no understanding."

The Grounding Spectrum

Perhaps grounding isn't binary. Rather than "grounded" or "ungrounded," understanding might exist on a spectrum:

A thermostat has essentially no grounding. It responds to temperature with no representation of what temperature means.

A text-only LLM has linguistic grounding: it knows the relationships between symbols as humans use them, without direct sensory connection to what they represent.

A multimodal model has perceptual grounding: it connects visual and sometimes auditory representations to linguistic ones, though without physical interaction.

An embodied AI system has sensorimotor grounding: it interacts with a physical environment and its language connects to actions with consequences.

A human has full grounding: developmental, sensory, emotional, social, and linguistic, all interwoven over years of embodied experience.

Each step adds something. Whether any step constitutes "real understanding" may depend on what we demand understanding to be. If understanding requires human-style embodiment, only humans understand. If understanding requires connection to consequences, embodied AI may qualify. If understanding requires only correct functional relationships between concepts, even text-only models may possess some form of it.

The grounding problem doesn't have a clean answer. It has a question that grows more nuanced with each new kind of AI system. What the Chinese Room asked about syntax and semantics, the grounding problem asks about symbols and the world: can you understand a world you've never touched? The answer may be less about yes or no, and more about what kind of understanding is possible without what kind of contact.

References

[1] Stevan Harnad, "The Symbol Grounding Problem," Physica D, Vol. 42, 1990, pp. 335-346. https://arxiv.org/html/cs/9906002v1/

[2] Frank Jackson, "Epiphenomenal Qualia," Philosophical Quarterly, Vol. 32, No. 127, 1982, pp. 127-136. For a comprehensive overview, see the Stanford Encyclopedia of Philosophy, "The Knowledge Argument." https://plato.stanford.edu/entries/qualia-knowledge/

[3] Danny Driess et al., "PaLM-E: An Embodied Multimodal Language Model," Google Research and Technical University of Berlin, 2023. https://palm-e.github.io/

[4] Google DeepMind, "RT-2: New model translates vision and language into action," July 2023. https://deepmind.google/discover/blog/rt-2-new-model-translates-vision-and-language-into-action/