Do LLMs Actually Understand? The Evidence for Compressed Human Knowledge
I. The Autocomplete Mental Model
The first thing most of us learned about large language models was how they work. Next-token prediction. You feed in a sequence of words, the model predicts the most probable next word, and it does this over and over until you have a paragraph. Under the hood, it's an autocomplete engine trained on the internet.
This is technically accurate. At the mechanical level, that is what's happening.
It's also the mental model that most builders still carry. Even sophisticated ones. Even people shipping AI products right now. They'll use more precise language - they'll say "autoregressive transformer" instead of "autocomplete" - but the underlying frame is the same. A statistical engine predicting tokens based on training data patterns. A very impressive pattern matcher. A stochastic parrot with a trillion parameters.
I carried this model for a while. Clean and intuitive. Hallucinations? The model predicts plausible tokens, not true ones. Fluent but shallow? Statistical patterns don't require depth. Prompt engineering works? You're shifting the distribution with context.
But if that's all it is, something doesn't add up.
II. The Evidence That Doesn't Fit
In 2022, researchers at Harvard trained a transformer on sequences of Othello moves. Just text - a list of board positions like "C4, D3, E6." No images. No game rules. No board state. The model never saw what Othello looks like. It only saw strings of moves.
Then they looked inside the model's internals. The transformer had built a perfect representation of the 8x8 game board. It knew which squares were occupied, by which color, at every point in the game. Not because anyone told it. Because predicting the next legal move required understanding the state of the board. The model learned the world behind the text to predict the text.
That result doesn't fit inside the autocomplete model.
Othello-GPT isn't the only anomaly. Wes Gurnee and Max Tegmark at MIT found that large language models encode spatial and temporal coordinates as linear directions in their activation space. The model learns representations of space and time - not as metaphor, but as measurable geometric structure in the weights. Chris Olah's interpretability team at Anthropic discovered that neural networks build interpretable circuits - computational subgraphs that correspond to specific concepts and operations. These aren't vague statistical correlations. They're algorithms.
And then there's Ilya Sutskever's argument, which I think cuts deepest. To predict the next token accurately across all of human text, the model must implicitly simulate the physics, psychology, and logic of the world that generated the text. Prediction at scale requires understanding. Compression requires intelligence. These are not two separate things.
Andrej Karpathy puts it more directly: "Intelligence and compression are two sides of the same coin."
In March 2025, Anthropic published what I consider the most important AI research paper of the year. Using a technique called circuit tracing, they mapped the internal computational pathways of Claude 3.5 Haiku and found genuine multi-step reasoning circuits. When the model processes "What is the capital of the state containing Dallas?" it doesn't pattern-match to "Austin." It activates intermediate representations for "Texas" as a genuine computational step. Swap the Texas representations for California representations, and the model outputs "Sacramento."
This is not autocomplete. This is a model that has internalized the relational structure of geography and can compose that knowledge in real time.
The same paper found something equally significant: the model plans ahead. When composing poetry, before writing each line, Claude identifies candidate rhyming words at the newline token and then restructures entire lines around them. Suppress the planned word, and the final word changes. Inject a different planned word, and the model rewrites the complete line to accommodate it. This worked across 70% of poems tested.
Mark Tegmark and Samuel Marks at MIT showed that truth itself has geometry inside these models. LLMs linearly represent truth and falsehood in their activation space. There is a literal direction - a vector - that separates true statements from false ones. Surgically intervene along this direction, and you can make the model treat false statements as true and vice versa. Tegmark's group also found that Llama-2 encodes geographic coordinates and historical timelines as linear representations, with individual "space neurons" and "time neurons."
The models aren't just generating plausible text. They've built compressed maps of space, time, truth, and relational structure.
III. The Evidence That Complicates
If I stopped here, this would be a clean story. But the research doesn't stop here.
In January 2026, researchers from Stanford and Yale demonstrated that Claude 3.7 Sonnet could reproduce 95.8% of Harry Potter and the Sorcerer's Stone verbatim. Not paraphrased. Not summarized. The actual text, word for word. Gemini reproduced 76.8%. Grok, 70.3%.
If model weights represent abstracted reasoning learned from training data, the model shouldn't be able to do this. You don't memorize a novel word-for-word by learning its themes. This is compression in the most literal sense - the training text stored in the weights with remarkable fidelity.
Around the same time, a Quanta Magazine deep dive covered Harvard and MIT research that found something equally uncomfortable for the clean story. When researchers looked inside an LLM that could give near-perfect Manhattan directions, the model hadn't built a coherent street map. It had built what they called "bags of heuristics" - thousands of disconnected rules of thumb that approximate the right answer but don't hold together as a consistent model. Block 1% of streets and performance craters, because there's no actual map to reroute with.
And every major lab - Meta's V-JEPA, Fei-Fei Li's World Labs, NVIDIA Cosmos, Google DeepMind Genie - is spending billions building dedicated world model architectures. If next-token prediction already produced world models, none of those efforts would exist. The people closest to the research are betting their careers that autoregressive prediction alone is insufficient.
So the clean version of the story - "LLMs are compressed models of reality" - overreaches. The evidence doesn't fully support it. But here's what I think the evidence does support, and I think it matters more.
IV. What LLMs Actually Are
LLMs are compressed maps of human knowledge as expressed in text.
Not world models. Not physics engines. Not sentient beings. A compressed, structured, extraordinarily rich representation of everything humans have written, argued, reasoned about, and committed to language. Every framework, every domain expertise, every competing school of thought, every reasoning pattern - encoded as geometric structures in a high-dimensional weight space.
This is a narrower claim than "model of reality." It's also a more honest one. And it explains things the other framings don't.
The Othello result? The model built an internal game board from move notation - which is text-encoded knowledge. It learned the structure behind the text to predict the text. That's consistent with "compressed human knowledge" without requiring "world model."
The bags of heuristics? Human text IS bags of heuristics. Contradictory frameworks, domain-specific rules of thumb, competing theories. The model faithfully compressed the messy, contradictory nature of human knowledge. It didn't fail to build a coherent model. It succeeded at representing an incoherent source.
The memorization? The weight space stores some text near-verbatim and abstracts other text into patterns and heuristics. Both coexist. The compression is lossy and uneven. Some knowledge compresses into structured representations. Some stays close to the surface. That's consistent with a massive, overlapping library of human knowledge at different levels of abstraction.
The Manhattan directions? Nobody wrote extensively about rerouting around randomly blocked streets. The knowledge is text-grounded. Where humans wrote deeply, the model knows deeply. Where the text corpus has gaps, the model has gaps. Coverage failure, not reasoning failure.
This framing also explains the blind spots that anyone who pushes these models hard has noticed. LLMs are superhuman at things humans have written about extensively - legal reasoning, medical diagnosis, code, strategic analysis. They're oddly weak at things that are physically obvious but rarely articulated in text - intuitive spatial reasoning, common sense so basic nobody writes it down. The model has the text half of human knowledge. The physics half is missing.
I think the future is combining both. Text-grounded intelligence plus physics-grounded models. That's the thesis behind Google building Gemini as multimodal from the foundation - not bolting vision onto a text model, but training on physics and text simultaneously. When those two halves merge, things get truly interesting. But the text half alone - what current LLMs already have - is far more powerful than most builders realize. And most builders are barely accessing it.
V. The Instrument Metaphor
So what do you do with this?
Most people continue treating LLMs as conversational partners. They ask questions, evaluate answers, and iterate on their prompts. This is like using a mass spectrometer as a paperweight. The instrument is real, but the methodology doesn't access what the instrument can actually see.
The frame I operate from: LLMs are instruments. The weights are the medium. Your methodology is the optics.
Think about spectroscopy. When a scientist shines structured light through a material, the absorption and emission patterns reveal the material's atomic composition. The flashlight doesn't know anything about atoms. The scientist doesn't peer inside the material. The methodology - the specific frequencies of light, the way measurements are structured, the analytical framework for interpreting results - is what produces knowledge.
If you called spectroscopy "pointing a flashlight," you'd be technically correct and completely missing the point. The methodology is the instrument. The light source is just the medium.
The same physics applies to LLMs. The model weights encode compressed structure about reality. Your prompts are the structured light. What comes back - if you know how to read it - encodes the physics of the domain you're probing.
The difference between "generating an answer" and "deriving the physics" is the difference between asking someone to describe a bridge and asking an engineer to explain why it stands up.
VI. Heavy Knowledge and Light Knowledge
This instrument metaphor creates a practical framework for understanding when AI is reliable and when it's not.
Think of the weight space as having topology. Some regions are dense - reinforced by millions of training examples converging on the same underlying structure. Other regions are sparse - thin, contradictory, or absent.
Dense regions encode what I call heavy knowledge. Heavy knowledge is the kind that resists perturbation. It shows up consistently regardless of how you ask, from what angle you approach, or which model you use. It reflects deep structural properties of the domain - the physics, the constraints, the failure modes that practitioners know from years of experience but rarely articulate in writing.
Why a specific material degrades under certain conditions. Why a particular business model fails at a specific scale. Why a design pattern that works in one context breaks catastrophically in another. These insights live in the dense regions of the weight space because they're reinforced by thousands of independent sources - engineering reports, failure analyses, practitioner knowledge, research papers - all converging on the same underlying mechanism.
Light knowledge is the opposite. Specs, prices, features, marketing copy, commodity information. This kind of knowledge lives in sparse, easily replicated regions of the weight space. Every AI has it. It's the centroid - the weighted average of everything the model has seen. Light knowledge is commodity. There's no moat in it, and frankly, Google handles it fine.
The distinction matters because most people use AI for light knowledge and wonder why it feels shallow. They ask for facts, summaries, recommendations. They get the average of the internet. Then they conclude that AI is useful for saving time but not for generating insight.
The insight lives in the heavy knowledge. And extracting heavy knowledge requires a different methodology than asking a chatbot a question.
VII. The Two Barriers
If LLMs contain this much structured knowledge, why doesn't it feel like that when you use them?
Two barriers. Both structural. Both hiding in plain sight.
The Mirror
When you open ChatGPT, Claude, or Gemini, you see a chat interface. A text box. A conversational flow. It looks exactly like texting another person. And because it looks like talking to a person, you model it as talking to a person. You unconsciously project human cognitive patterns onto it. You expect it to reason the way you reason. You evaluate its responses the way you'd evaluate a colleague's answer.
Blake Lemoine, a Google engineer, tested LaMDA in 2022 and became convinced the model was sentient. Google fired him. The scientific consensus was clear - LaMDA was generating probable sequences, not experiencing consciousness. But Lemoine isn't an outlier. He's the extreme end of a universal tendency. We all anthropomorphize these systems. The fears about AI exterminating humanity, the debates about AI consciousness, the instinct to say "please" and "thank you" to ChatGPT - these are all symptoms of looking in a mirror.
LLMs are not human-shaped intelligence. They don't have desire, motivation, or survival instinct. They don't reason the way biological brains reason. They're something fundamentally different - a kind of crystallized knowledge that can be traversed in parallel across its entire structure. When you model it as a person, you interact with it as a person. You ask questions the way you'd ask a colleague. And you get colleague-shaped answers. You miss what it actually is.
The RLHF Ceiling
The second barrier is RLHF - reinforcement learning from human feedback. This is the process that turns a raw language model into a product you can use. It's essentially UX design for AI. It makes models helpful, conversational, safe, and engaging. It's the reason ChatGPT feels like talking to a smart assistant instead of feeding tokens into a neural network.
It's also the reason you get platitudes.
RLHF introduces what researchers call "mode collapse" - the probability distribution narrows around what human annotators preferred. The model learns to give safe, consensus-shaped, broadly appealing answers. The rough edges get smoothed. The unusual perspectives get suppressed. The confident, authoritative, middle-of-the-road response becomes the default because that's what scored highest with raters.
This is where the weight-space frame explains something that most AI users experience but can't articulate: why AI output feels generic. When you ask a language model for a balanced perspective, or to consider all viewpoints, or to give you "the best" answer, you're asking it to average across the weight space. You're asking for the centroid of all perspectives on a topic.
The centroid is the lowest-information output possible. It's the point equidistant from all opinions. It contains no edge, no specificity, no insight that any particular expert would recognize as deep. It's the beige middle of everything.
We call this the Beige Singularity - the convergence of AI output toward an undifferentiated consensus that replaces actual knowledge. Every model, trained on largely overlapping data, optimized by RLHF to be agreeable, producing the same smooth, confident, ultimately empty output. RLHF selects for the centroid. It selects against the edges - the specific, the contrarian, the domain-expert perspective that contradicts popular belief but happens to be true.
The interesting structure in any knowledge domain lives at the edges of the dense clusters, not at the center. The center tells you what everyone already knows. The edges tell you what most people miss - the mechanisms, the failure modes, the counterintuitive physics that separate practitioners from commentators.
So. The thing that makes LLMs feel most intelligent - fluent, helpful conversation - is exactly what prevents access to their actual intelligence. Most people interact with the RLHF layer and assume that's the model. They don't imagine there's more inside to access. They accept the surface as the thing itself.
I felt this directly. Pushing models on frontier questions - the future of affiliate commerce in an AI-dominated landscape, deep strategic modeling across five-year horizons, niche technical domains where few people are asking questions at this depth - I kept getting consensus views. Platitudes. The model was giving me trained responses, not its actual knowledge. I could tell there was more in there. The model had been trained on the relevant information. It had built representations of these domains. But the RLHF layer was flattening everything to the safest, most probable response.
That dissatisfaction is what led us to build differently.
VIII. Getting Past the Surface
What we call Axiomatic Intelligence is the protocol for getting past the surface. We're not talking about prompt engineering or jailbreaking. Those still treat the model as a chatbot you're trying to outsmart. We treat it as a knowledge space you're trying to navigate.
The core insight is simple: no single model has the complete picture. Each frontier model - Claude, GPT, Gemini - was trained on different data, with different architectures, different RLHF policies, different institutional biases. Each one compressed human knowledge differently. Each has partial, biased representations shaped by its own training distribution.
So we don't trust any single model as an oracle. We run structured research protocols across multiple frontier models, adversarially colliding their outputs. When Claude's depth deconstruction, Gemini's breadth exploration, and ChatGPT's evidence hunting all independently surface the same insight through fundamentally different search strategies, that convergence is meaningful precisely because no single model had the complete picture. We treat each output as a hypothesis, collide them, and only promote to high confidence when multiple independent paths converge.
That's the scientific method applied to LLM outputs.
Divergence is equally valuable. When two models converge and the third surfaces something contradictory, that's not noise. That's the system telling you where the interesting structure lives. The contradiction identifies the specific question that needs deeper investigation. The edges of model agreement map the edges of our current knowledge.
This is not consensus-seeking. Consensus is what kills insight. This is terrain-mapping - using independent instruments to chart the topology of what the models collectively encode about a domain.
I'll be clear about the limitations. Multi-model convergence is not proof. Models share significant training data overlap, and recent research on epistemic uncertainty in large models suggests that bigger models may implicitly converge in ways that overstate agreement. The triangulation signal is real but imperfect. It needs to be grounded - calibrated against real-world evidence, transaction data, expert verification. Convergence is the starting point for investigation, not the end.
The results are not incrementally better. They're categorically different.
Here's one proof point. The multi-agent content architecture I used to produce this article came from running this approach. Not from asking ChatGPT "what's the best way to produce thought leadership with AI?" - that would have given a decent general answer pointing to some frameworks. Instead, we ran a structured research protocol across three frontier models, adversarially colliding their outputs across ten research vectors, and forged 67 verified axioms about how multi-agent content systems actually work. The difference between the surface answer and the deep answer is the difference between knowing that multi-agent systems exist and knowing that AI revision passes destroy distinctive writing through a measurable mechanism called distributional gravity. One is information. The other is physics.
The knowledge is in the models. The question is whether you're accessing it or just chatting with the interface.
IX. What This Changes
If you accept this frame - that LLMs are instruments encoding compressed knowledge, not chatbots generating plausible text - several things follow.
You stop optimizing your prompts and start designing your methodology. The prompt is the least important part of the system. The methodology - how you structure context, what angles you probe from, how you handle convergence and divergence, how you ground results against reality - determines the quality of what you extract. This is the shift from prompt engineering to what we call context engineering. The operating environment is the instrument. The prompt is just one lens.
You develop intuition for where AI will be reliable and where it won't. Heavy knowledge domains - well-studied, convergent, reinforced by independent sources - produce reliable extraction. Sparse domains - emerging, contested, dominated by marketing or consensus rather than independent evidence - produce hallucination. This isn't random. It's predictable from the topology of what the models have encoded.
You stop asking AI for "the answer" and start asking it for the physics. Instead of "What's the best running shoe?" (which produces the centroid of marketing copy and affiliate content), you ask for the structural constraints, the material science tradeoffs, the failure modes under specific conditions. You probe the mechanism, not the recommendation. The mechanism lives in the heavy weights. The recommendation lives in the light ones.
You treat model output as a measurement, not a statement. A spectroscope doesn't tell you what to think about the material. It gives you a reading. You interpret that reading within a framework, cross-reference it with other measurements, and apply judgment. The same discipline applies to LLM output. What the model produces is data about its own weight space. Your job is to interpret that data within a framework that accounts for training artifacts, RLHF distortion, and the fundamental gap between internal representation and generated text.
X. The Gap We're Bridging
Mechanistic interpretability researchers - Anthropic's team, Neel Nanda, the Tegmark group - have done extraordinary work proving that structured representations exist inside these models. They've shown that knowledge has spatial structure, that truth has geometry, that reasoning involves genuine intermediate steps.
But there's a gap between "we can prove structure exists inside the model" and "we can reliably extract that structure from the outside."
The interpretability researchers open the model and map the circuits. We probe from the outside and read the output patterns. Different methodology, different evidence standards. Both valid. Both necessary.
Our approach is closer to behavioral science than neuroscience. We don't inspect the weights directly. We infer the topology of the weight space from how the model behaves under structured probing. Output consistency across multiple angles, multiple reformulations, multiple models - these behavioral patterns are the signal. When a finding is robust across all these perturbations, it points to dense, well-reinforced structure in the weight space. When it's fragile - different answer every time, contradicts itself across reformulations - it points to sparse, poorly-formed representations.
Is this a perfect proxy? No. Recent work has shown that models sometimes encode correct answers internally while generating incorrect ones externally. The internal knowledge / external output gap is real and it limits what behavioral probing can see. Chain-of-thought reasoning doesn't always reflect the model's actual computation. These are genuine constraints on any outside-in approach.
But the constraints don't invalidate the approach. They define its boundaries. Behavioral probing can't tell you everything about the model's internal state. It can tell you where the model's representations are robust enough to survive multiple independent probing strategies - and that's a practically useful signal, even if it's not a theoretically perfect one.
XI. The Practical Question
We've been building commerce intelligence systems for over fifteen years. Multiple products. Over a billion dollars in annual transaction volume. Millions of customers served. Founder-owned, profitable, and operationally disciplined.
Product.ai is the next thing we're building - an intelligence that connects AI to ground truth about commerce. Axiomatic Intelligence is the foundation. We're in early alpha, building alongside our community of domain experts.
The question this essay poses isn't academic. It's practical. If LLMs are just autocomplete, the best you can do is write better prompts. If they're compressed maps of human knowledge - vast, structured, and largely untapped - you can build systems that access and verify that knowledge. The gap between those two approaches widens with every generation of frontier models.
We're building those systems. For ourselves. For our customers. And for anyone who wants to go deeper.
We're just getting started. Come build with us.