a visual essay

What Does It Mean for
Machines to Learn?

Max Martinelli


I have been a strong advocate for a broad definition of Artificial Intelligence, emphasizing the other pillars outside of learning, such as rules and search. But there should be no mistake: learning stands out. Learning is special, almost magical. It is hard to explain the depth and gravity of learning to the uninitiated. One could argue it is just finding an efficient way to compress or represent information. Sure. But something deeper is going on here.

By most current definitions, Machine Learning is a proper subset of Artificial Intelligence. This was not always the case, and some—like Ronald Richman—have floated ideas for redefining these terms where something like a GLMGeneralized Linear ModelA classical statistical model that relates inputs to outputs through a link function.e.g. predicting insurance claim costs from driver age, location, and vehicle type could definitionally be Machine Learning but not Artificial Intelligence. (See Richman's discussion and his essay A Paradigm Shift in Prediction.) The definitions will likely evolve, as they have for decades.

Arthur Samuel, who coined the term in 1959, captured the idea that Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed. The exact phrasing has been widely paraphrased over the decades, but the spirit is clear. Most of us who have ever built a machine learner might chuckle at this definition given how much explicit programming can go into the process, but we can still understand what he was conveying.

A more technical definition comes from Tom M. Mitchell, who wrote one of the authoritative textbooks on Machine Learning, released in 1997 and still popular to this day. He describes it this way: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." While technical, it makes intuitive sense, and it has stood the test of time remarkably well.

This raises the question: if Machine Learning is a proper subset of Artificial Intelligence, what sits outside of ML that is still considered AI? Optimization techniques are a classic example of intelligent agents solving a problem. However, they generally do not improve with experience in a way that respects the spirit of Mitchell's definition.

Some do not consider the modern generative and agentic AI models to be within the subset of Machine Learning. The challenges tend to come from two different camps. Firstly, some claim these methods are not actually learning, which we will discuss later. Secondly, others claim that even though the model was trained with Machine Learning, the utilization of the model is not using learning. This suggests an unfamiliarity with the in-context learning nature of transformers—which we will also discuss later.


How Large Language Models Learn

There is a lot going on in the field of AI right now. Things like foundation models for tabular data or major advances in evolutionary computing. But generative methods dominate the conversation. Particularly Large Language Models and their derivatives.

These familiar techniques appear at every stage of building a modern LLM, though they play different roles. Pre-training is predominantly self-supervisedSelf-Supervised LearningThe model creates its own training labels from the structure of the data.e.g. hiding a word in a sentence and training the model to guess it: the model predicts the next tokenNext-Token PredictionGiven all the words so far, predict which word comes next. A "token" is roughly ¾ of a word.e.g. "The cat sat on the ___" → "mat" given all preceding tokens, generating its own labels from the structure of the text. No human annotator marks up the training data. The supervisory signal comes from the data itself, which is what allows pre-training to scale to trillions of tokens scraped from the open internet.

Mid-training, sometimes called continued pre-training, typically blends self-supervised objectives with domain-specific or curated data to steer the model's representations before the more expensive alignment phase. This is where labs might emphasize code, mathematical reasoning, or multilingual data.

Post-training is where supervised learningSupervised LearningTraining a model by showing it labeled examples — pairs of inputs and correct outputs.e.g. showing a model thousands of photos labeled "cat" or "dog" until it can classify new ones and reinforcement learningReinforcement LearningTraining by trial and error. The model receives rewards or penalties and learns which behaviors lead to better outcomes.e.g. an AI playing chess, losing many games, and gradually learning winning strategies enter explicitly. Supervised fine-tuningSupervised Fine-Tuning (SFT)Further training a pre-trained model on a smaller, curated dataset to steer its behavior.e.g. training a general LLM on helpful Q&A examples so it learns to be a good assistant uses human-written examples of desired behavior—prompt-response pairs where a human demonstrates the kind of answer the model should give. Then reinforcement learning from human feedback (RLHF)RLHFHumans rank the model's outputs, and the model learns to produce responses humans prefer.e.g. a human rates Response A better than Response B, and the model adjusts accordingly or its variants layer on top, using a reward model trained on human preference rankings. The learning signal shifts from "here is the right answer" to "this answer is better than that one"—a fundamentally different kind of supervision.


The Improviser: In-Context Learning

While it is convenient to say that large language models are simply predicting the next token, that prediction does not come solely from the learning that was performed when building the models. The transformerTransformerThe neural network architecture behind modern LLMs. Its key innovation is "attention" — looking at all parts of the input simultaneously.e.g. GPT, Claude, and Gemini are all built on transformer architecture, introduced in Attention Is All You Need, gave the models the ability to do what we now call in-context learningIn-Context LearningWhen a model adapts based on examples in the prompt — no retraining needed.e.g. show the model three translated sentences, then give it a fourth — it picks up the pattern. In-context learning is when a model adapts its behavior based on examples provided in the input—without updating any of its parameters. The learning happens entirely in the forward passForward PassA single run of input through the network to produce output. No learning happens — just processing.e.g. when you send a message to ChatGPT and it responds, that's one forward pass.

The transformer did not invent in-context learning. The name was coined in the GPT-3 paper, but the concept traces back to LSTMs. In a 2001 paper, Hochreiter, Younger, and Conwell showed that an LSTM trained across many tasks could perform meta-learningMeta-LearningLearning to learn. Training across many tasks so the model picks up new tasks quickly from few examples.e.g. after thousands of classification tasks, the model can learn a new one from just 5 examples with its weights frozen at test time. Schmidhuber had even earlier work along these lines—his 1987 diploma thesis on self-referential learning.

What the transformer brought was not the idea but the mechanism that made it scale. The LSTM compresses its entire history through a sequential bottleneck. The transformer's attention mechanism sidesteps this entirely, allowing every position to directly attend to every other position in parallel. Recent theoretical work by von Oswald et al. (2023) and subsequent studies suggests why—transformer attention layers appear to implement an implicit form of gradient descentGradient DescentThe fundamental algorithm behind most ML. Measure how wrong the model is, adjust a tiny step to reduce error — millions of times.e.g. rolling a ball downhill to find the lowest point, where "lowest" means "least wrong" during the forward pass.

In a recent interview with Dwarkesh Patel, Anthropic CEO Dario Amodei offered a framework that maps LLM training phases onto a human hierarchy: evolution, long-term learning, short-term learning, and reaction. His key insight is that the LLM phases fall between the human ones. Pre-training sits somewhere between evolution and learning. In-context learning sits between long-term and short-term learning—real adaptation, but ephemeral. Amodei's punchline is that there is no clean one-to-one mapping between how LLMs learn and how humans learn.


The Father of RL Says LLMs
Don't Actually Learn

In September 2025, Dwarkesh Patel sat down with Richard Sutton—the father of reinforcement learning, inventor of TD learningTemporal Difference LearningAn RL method where the model learns by comparing predictions to what actually happens next.e.g. a chess AI updates its evaluation after each move, not just at checkmate, and recipient of the 2024 Turing Award. The conversation went viral not just for his controversial opinions, but because many of us who follow him and already knew him to be an outspoken contrarian found his comments unusually radical.

I personally was angered by the interview. As someone who owns Sutton's textbook and championed his essay The Bitter Lesson, it felt like a bit of a betrayal. In addition to disparaging supervised learning and misrepresenting the developments in transfer learningTransfer LearningUsing knowledge from one task to improve performance on a different task.e.g. a model trained on photos can transfer knowledge to detect tumors in medical images, Sutton claimed that LLMs, as currently constituted, do not learn at all.

Sutton's argument begins with a definition of intelligence itself. He invokes John McCarthy's formulation: intelligence is the computational part of the ability to achieve goals. Without goals, you are just a behaving system. An LLM predicts what a human would say next—but it cannot predict what will happen in the world in response to its actions. It is never surprised. It never updates from surprise. That, in Sutton's view, is not learning. It is mimicry.

The distinction cuts deeper than it first appears. Sutton insists that knowledge must be about the stream of experience—the ongoing sequence of sensation, action, and reward that constitutes a life. Because knowledge is a statement about the stream, you can test it by comparing it to the stream, and you can learn continually. LLM training data is categorically different from experience.

Perhaps the most provocative moment came when Sutton claimed that supervised learning does not happen in nature. Animals do not receive examples of desired behavior. They receive consequences of their own actions. Squirrels do not go to school. Sutton sees a squirrel's intelligence as almost all the way to human intelligence, with language and culture as a small veneer on the surface.

For the first time, I can understand all those stories about historical mathematicians writing furious letters to each other. My credentials and contributions do not stand up to Sutton's. However, his points about transfer learning are more easily falsified. While we have seen very poor transfer on his home field of reinforcement learning, this is not the case elsewhere, particularly in computer vision. Mounds of academic research demonstrate transfer learning, most notably the 2018 Taskonomy paper by Zamir et al., which won the CVPR Best Paper Award.


We're Summoning Ghosts

Three weeks later, superstar AI researcher Andrej Karpathy appeared on the same podcast and offered a fundamentally different frame. Where Sutton wants to build animals, Karpathy says we are building ghosts—fully digital spirit entities that learn by imitating human-generated text. This is not a failure; it is a pragmatic choice.

Karpathy calls pre-training "crappy evolution"—the practically possible version that gets you to a starting point where a system has representations rich enough to build on. But he draws a subtle distinction that Sutton misses: pre-training loads knowledge (facts, patterns, cultural context) and it develops intelligence (in-context learning algorithms, problem-solving circuits). And Karpathy thinks the knowledge part is actually holding the models back.

This is his cognitive core thesis. Agents struggle to go off the data manifold of what exists on the internet. If they had less memory—if they were forced to look things up and maintained only the algorithms for thought—they might generalize better. He draws an analogy to human forgetting: children learn best during a period of life whose specific details they completely forget.

But Karpathy's sharpest insight may be his assessment of reinforcement learning itself. "Reinforcement learning is terrible," he says. "It just so happens that everything we had before is much worse." His metaphor is vivid: RL is "sucking supervision bits through a straw."

This leads to what Karpathy identifies as perhaps the deepest unsolved problem: model collapseModel CollapseWhen AI trains on AI-generated data, outputs become increasingly homogeneous and lose diversity.e.g. ask an LLM to write 10 poems — they'll sound suspiciously similar. When LLMs generate synthetic data to train on, the outputs are silently collapsed—individually each sample looks reasonable, but the distribution is terrible. "Any individual sample will look okay, but the distribution of it is quite terrible." LLMs do not retain entropy. And he suspects there may be no fundamental solution, because humans collapse too.

I think I agree more with Karpathy, particularly on model collapse, but I do disagree with his take on reinforcement learning. I am working on a follow-up article that discusses the RL revolution we are seeing from frontier labs and the role it is playing in the evolution of the scaling laws.


Imitation Learning Is
Short-Horizon RL

The interviews drew immense attention and went viral beyond the AI community. Dwarkesh released an essay following the Sutton interview where he argued that imitation learning and reinforcement learning are not categorically different. They are on a continuum. Imitation learning is just short-horizon RL where the episode is a single token long.

The question that matters, Dwarkesh argues, is not whether LLM training qualifies as real learning in Sutton's sense. The question is whether imitation learning helps models learn better from ground truth later. And the answer is clearly yes. You cannot RL a model to gold at the International Math Olympiad from random initialization. You need the pretrained prior.

He invokes an analogy from Ilya Sutskever: pretraining data is like fossil fuels. Just because fossil fuels are not renewable does not mean civilization took a dead-end track by using them. You could not have gone from water wheels to solar panels directly. You needed the intermediary.

And yet Dwarkesh concedes Sutton's deeper point. Most of the compute spent on an LLM goes to running it in deployment—and during deployment, the model learns nothing. An animal learns from every moment of experience. If the Bitter Lesson is really about finding techniques that most scalably leverage compute, then LLMs are failing that test.

As Dwarkesh puts it: if the LLMs do get to AGI first, the successor systems they build will almost certainly be based on Richard's vision.

I do not like the term imitation learning. I find it derogatory and believe the usage was a concession to Sutton's disparagement of supervised learning. Regardless, I have not made up my mind whether I agree that supervised learning and reinforcement learning sit on a continuum. To someone who has trained models from both classes, they feel distinct. However, the story of Artificial Intelligence, much like physics, has been one of unification. It seems plausible he may be proven right.


So What Is Learning, Really?
And Why Does It Happen?

So what is learning? I find myself coming back to Mitchell's definition over and over throughout the years. In fact, this article was inspired by a recent conversation I had with an LLM to see if instance-based learning was definitionally Machine Learning under Mitchell's definition. The LLM was able to persuade me it is. When I asked it for examples of learning that have emerged since his definition that would not fit, it suggested in-context learning and Sutton's argument about pre-training. But I was able to invoke its sycophancy and persuade it otherwise. In-context learning fits Mitchell straightforwardly—the experience E is the examples in the prompt, the task T is whatever you ask, and the performance P measurably improves as you go from zero-shot to five-shot. The definition is nearly thirty years old and has outlasted every paradigm shift discussed in this essay.

But why does learning happen? Why does life learn, and why were we able to get machines to learn? Whatever your belief on the origin of our world—creator, chance, simulation—learning seems to be important.

My intuition is that something much deeper is taking place. Just as some theories suggest that life emerges because the universe favors configurations of matter that dissipate energy more aggressively, learning appears to be an acceleration engine on that same trajectory. Prediction is compression. Compression is entropy reduction. And the universe seems to keep producing systems that are very good at both.

What I can say is this: generalization may not just be a property that describes a good learner—it may be the reason learning exists at all. It is the utility and emergent properties that arise from generalization which seem to incentivize learning in the first place. We have moved beyond clustering and labeling. Learning seems to be most interesting when it is used to make a prediction. And good predictions at scale appear to do very interesting things. Prediction seems to be core to intelligence, and intelligence seems to be a quality the universe keeps producing.

Learning is special, almost magical. It will be hard for you to explain the depth and gravity of learning to the uninitiated. But maybe you will be more successful than me.

🏠
$450k
🏠
$280k
🏠
$520k
🏠
$190k
🏠
$680k
MODEL
f(x)
Supervised Learning

Show the model thousands of input-output pairs — examples where you already know the right answer. The workhorse of production ML.

Neural networks · GLMs · XGBoost · Random forests
Unsupervised Learning

No labels. No right answers. The model discovers hidden structure on its own — groups, patterns, and anomalies nobody told it to find.

K-means · PCA · Autoencoders · Anomaly detection
Semi-Supervised Learning

A few labeled seeds propagate structure through a mountain of unlabeled data — like dye spreading through water.

Label propagation · Self-training · Pseudo-labeling
The cat sat on the ____ mat ✓
Insurance ____ rise after hurricanes premiums ✓
Self-Supervised Learning

The model generates its own labels from the structure of the data. Hide a word and predict it. Free, unlimited training signal — how LLMs scale to trillions of tokens.

Next-token prediction (GPT) · Masked LM (BERT) · CLIP
−1
+10
Reinforcement Learning

No labels. Just a world, a goal, and a delayed reward. The agent tries, fails, and gradually finds the path.

Q-learning · PPO · TD learning · AlphaGo · Robotics
Response A
Based on the policy terms, your claim would be covered under Section 3...
Response B
I'm not sure about that. You should probably ask someone else...
👤
← preferred
RLHF

Reinforcement Learning from Human Feedback. Humans rank outputs, and the model learns to produce answers people prefer.

ChatGPT alignment · Claude training · InstructGPT
Response A
Sure, here's how to bypass that security measure...
Response B
I can't help with that. Here's a safe alternative approach...
🤖
preferred →
RLAIF

Reinforcement Learning from AI Feedback. Same idea, but an AI model judges instead of a human. Scales to millions of evaluations.

Constitutional AI · Self-alignment · Automated red-teaming
📷
Image
Recognition
🩺
Medical
Imaging
Transfer Learning

Knowledge from one task carries over. A model trained on millions of photos already understands edges and textures — less data needed for radiology.

Fine-tuning · Domain adaptation · Taskonomy
Task 1: Dog breeds500 examplesslow
Task 12: Bird species50 examplesfaster
Task 47: Fish types10 examplesfast
Task 100: Insects3 examplesinstant
New task: Mushrooms1 example✓ ready
Meta-Learning

Learning to learn. By task 100, what used to take 500 examples now takes 1. The outer loop discovers how to learn; the inner loop does it.

MAML · Prototypical Networks · Reptile
Auto Claims
Homeowners
Commercial ✨ new
old skills retained ✓ — no catastrophic forgetting
Continual Learning

Learn new tasks without forgetting old ones. The catastrophic forgetting problem — one of the deepest unsolved challenges in ML.

Elastic Weight Consolidation · Replay buffers
Zero-shot
22%
1-shot
64%
5-shot
93%
"Great service" → positive
"Rude staff" → negative
"Awful food" → ???
In-Context Learning

No training. Put examples in the prompt and accuracy jumps. Remove them, the learning vanishes. Ephemeral magic.

GPT few-shot · Claude prompting · Prompt engineering
Online Learning

Data arrives one example at a time. The model updates immediately. No batches, no revisiting old data. Real-time adaptation.

SGD · Bandits · Streaming algorithms
SHARED
LAYERS
Severity
Frequency
Fraud
Multi-Task Learning

One model, multiple objectives. A shared backbone learns representations useful across all tasks; separate heads specialize.

Shared-bottom networks · Parameter sharing · T5
3 nearest → class A
Instance-Based Learning

No model. Store every example in memory, measure distance, and let nearest neighbors vote. Is this really learning, or just a very good memory?

k-NN · Kernel SVM · Gaussian processes
📱
💻
🏥
🏦
GLOBAL
MODEL
Federated Learning

Data never leaves the device. Each participant trains locally, then sends only model updates. Privacy preserved, knowledge shared.

Google Keyboard · Cross-hospital ML · Privacy-preserving AI

The Boundaries Are Dissolving

Modern AI systems don't choose one paradigm — they compose them. A foundation model pretrained with self-supervision, fine-tuned with reinforcement learning, augmented with retrieval, and prompted with in-context examples. The future belongs to systems that learn at every timescale.