Bend, Not Shift

AI April 25, 2026 10 min read Arjun Srivastava

Bend, Not Shift

The missing kind of learning in the age of scaling laws

The strangest thing about neural scaling laws is not that models get better as they get larger. Of course they do. The strange thing is how smoothly they get better.

Across enormous changes in model size, data, and compute, loss often falls along a simple power law. On a log-log plot, the curve becomes a line. Add more compute, and the model improves by a predictable fraction. Add another order of magnitude, and it improves by another predictable fraction. The process is astonishingly reliable. It is also strangely mechanical.

That straightness should bother us.

If intelligence were recursively discovering better abstractions, why would progress look so much like steady statistical refinement? Why would each new order of magnitude of compute buy roughly the same fractional improvement as the last? Why does the curve not bend?

This is not an argument that neural networks do not learn abstractions. They obviously do. The sharper question is whether those abstractions make future learning cheaper.

So far, mostly, they do not. Current systems learn many useful local abstractions, but those abstractions do not yet reliably collapse the search space in a way that changes the macroscopic learning regime.

A better model shifts the line. A different kind of learner bends it.

Shift vs Bend

Figure 1 — Shift vs Bend. Better engineering shifts the scaling curve. A different kind of learner bends it.

That distinction is the lens of this essay.


1. Power laws are useful, but slow

A common empirical form of neural scaling is:

$ L(C)-L_\infty = aC^{-\alpha} $

Here $L(C)$ is loss at compute scale $C$, $L_\infty$ is an irreducible floor, and $\alpha$ is the scaling exponent.

Kaplan et al. made this form central to modern language modeling by showing that loss scaled predictably with model size, dataset size, and compute across many orders of magnitude. Hestness et al. had already found similar power-law behavior across machine translation, language modeling, image processing, and speech recognition.

That breadth matters. Power-law scaling is not only a language-model fact, and it is not cleanly explained by any one data distribution, such as Zipfian word frequencies. The recurring object is broader: neural learners encountering tasks where structure has to be discovered from large effective spaces.

The practical lesson was obvious: scale works.

But scale works slowly.

If $\alpha = 0.05$, then increasing compute by 10x only multiplies the remaining loss by:

$ 10^{-0.05} \approx 0.89 $

That is about an 11% reduction in the gap to the floor. To halve the remaining gap, you need roughly:

$ 2^{1/0.05} \approx 1,000,000 $

or about a million-fold increase in compute.

This does not make scaling unimportant. A predictable 11% reduction can be economically enormous. But a power law is not the curve one would naturally expect from a system that repeatedly discovers abstractions that make future learning dramatically cheaper.

It is the curve one expects when each step keeps extracting information from a large effective space at a roughly stable rate.


2. A dumb sampler can draw the same curve

A power law does not require understanding.

Imagine sampling $n$ points uniformly at random inside a $d$-dimensional box. Pick a target point and ask how close the nearest sample gets.

The volume of a radius-$r$ ball in $d$ dimensions scales like $r^d$. To get one sample inside that ball, we need approximately:

$ nr^d \approx 1 $

which gives:

$ r \approx n^{-1/d} $

That is a power law.

No architecture. No gradient descent. No representation learning. No abstraction. Just blind sampling and order statistics.

HyperLogLog gives an even simpler version of the same intuition. A hash value with $k$ leading zero bits occurs with probability $2^{-k}$. To see such an event, you need about $2^k$ samples. The system improves by waiting for rare events. It remembers the best event so far, but it does not use that memory to change where future samples come from.

That is the warning. A smooth power law can look profound while coming from a process with no understanding at all.

Neural networks are not literally HyperLogLog, and this is not a mechanism claim. It is a thought experiment: a minimal system with memory, improvement, and a power law, but no abstraction. Once we see what keeps that system in the power-law regime — future samples still come from the same distribution — we can ask what a learner would need to change.

So the question is not simply whether the model is learning. It is. The question is whether what it learns changes what it needs to sample next.

Blind Sampling vs Adaptive Sampling

Figure 2 — Blind Sampling vs Adaptive Sampling. A blind sampler remembers its best draw but keeps sampling the whole space. An adaptive sampler uses partial structure to change where future evidence comes from.


3. Local abstraction is not compounding abstraction

The phrase "learns abstractions" hides the most important distinction.

A local abstraction helps a model perform on some region of the data distribution. A compounding abstraction makes future learning easier by narrowing the hypothesis class.

That is the difference between learning a useful feature and discovering a coordinate system in which many future examples become cheap.

A model can learn many local abstractions while the global curve remains a power law. Michaud et al.'s quantization model points in this direction: if many discrete skills are learned in frequency order, and their frequencies are heavy-tailed, the aggregate loss can fall smoothly as a power law. The global curve can look continuous even if many local transitions are happening underneath.

So the claim cannot be "power law means no abstraction." That is too crude.

The stronger claim is this: current neural networks learn abstractions, but those abstractions do not yet appear to compound strongly enough to change the macroscopic learning regime.

They improve performance. They do not reliably make the next layer of learning dramatically cheaper.

Local Abstraction vs Compounding Abstraction

Figure 3 — Local Abstraction vs Compounding Abstraction. Local abstractions improve performance in patches. Compounding abstractions reveal a shared structure that makes future learning cheaper.

This is why "does the model understand?" is often the wrong question. The better question is:

Does the model's understanding change the rate at which it can acquire more understanding?


4. Shift versus bend

The shift/bend distinction gives us a simple way to classify progress.

A shift means the system gets more performance per unit compute. The data is cleaner, the architecture is more efficient, the optimizer is better, the post-training is stronger, or the system uses compute more effectively. Shifts can be technically impressive and economically enormous.

A bend means something deeper: the learner has entered a different regime. Additional compute now buys more than the previous trend predicted.

The important invariant is not the exact value of $\alpha$. Exponents can move. Architecture, data quality, mixture, pruning, and optimization can all change slopes. The deeper question is whether the learner remains inside the same functional family, $L = aC^{-\alpha} + L_\infty$, or enters a different regime altogether. A slope change is still a better power law. A bend is evidence that the learning process itself has changed.

Many different mechanisms can produce power laws: manifold approximation, Zipf-distributed skill frequencies, percolation-like transitions, kernel spectra, order statistics, or mixtures of many small learning events. That is why reproducing the curve is not, by itself, a sharp explanation. It may be like arguing over whether the same message was sent in Morse code, UTF-8, or smoke signals: the mechanisms differ, but the coarser fact is the rate at which information gets through. Scaling curves may be similar. They may be less a fingerprint of one mechanism than a coarse measure of how quickly useful information can be extracted under a given training regime. This essay is therefore not trying to decide which mechanistic account of scaling is "the real one." The question is simpler: what would count as a different kind of learning? Not a better exponent inside the same functional family; not another intercept gain; not a more elegant story for why the line is straight. A real change would show up as a bend — evidence that the learner has discovered structure that makes future learning cheaper.

Much of modern progress is shift-like. Kaplan et al. showed that many architectural details mattered less than model size, data size, and compute across broad ranges. Hestness et al. observed that model improvements often shifted error curves more than they changed scaling exponents. These results do not say architecture is irrelevant. They say that within the studied regimes, many improvements changed efficiency more than they changed the curve class.

Sorscher et al.'s data-pruning work is the empirical hinge. Their paper does not merely show that smaller datasets can sometimes be better. It shows that if examples are ranked by a sufficiently good pruning metric, dataset-size scaling can break beyond the usual power law and, in the idealized theory, approach exponential scaling with respect to the pruned dataset size. They then report empirical signatures of this effect on image models trained on CIFAR-10, SVHN, ImageNet, and related vision settings.

That is the crack in the fatalistic reading of scaling laws. The curve is not sacred. Changing which examples the learner sees can change the scaling behavior itself.

But it also reveals the hard part: you only beat the power law if you know which examples matter. That moves the problem from the student to the teacher.

To bend the curve, you need more than a bigger model. You need a better way of deciding what the model should learn next.


5. The sampler should learn too

Modern pretraining is no longer raw IID internet soup. Frontier systems use filtering, deduplication, data mixtures, synthetic data, annealing, post-training, preference optimization, and reinforcement learning. But most training pipelines are still not stateful teachers in the strongest sense.

They do not usually ask, at each stage:

  • What abstraction has the model just formed?
  • What examples would test it?
  • What examples would extend it?
  • What examples are now redundant because the abstraction already compresses them?
  • What examples are too hard now but useful after one more prerequisite?

This is not a new field from scratch. Curriculum learning, active learning, hard-example mining, data pruning, model-aware data selection, and reinforcement learning all orbit parts of this idea. The useful move here is to connect them to a sharper empirical test: does the training process merely shift the scaling curve, or does it help the learner discover structure that makes future learning cheaper?

The field is already moving in this direction. DoReMi treats pretraining mixture weights as something to optimize rather than accept. Data Mixing Laws and RegMix use smaller runs to predict better data mixtures for larger models. MATES makes data selection depend on the evolving state of the model during pretraining. OLMo-style late-stage curriculum and annealing show that the same data can matter differently depending on when the model sees it.

The direction is clear: data is no longer passive fuel. It is a control surface.

But most of this work still selects data by source, quality, influence, difficulty, or mixture. The stronger goal is data pedagogy.

Data curation asks:

What data belongs in the dataset?

Data pedagogy asks:

Given what the model currently understands, what should it see next?

From Data Curation to Data Pedagogy

Figure 4 — From Data Curation to Data Pedagogy. Data curation filters the world. Data pedagogy sequences it.

A teacher does not merely sample from the world. A teacher sequences the world.

Human education works this way. We do not teach calculus by sampling uniformly from all mathematical text. We build prerequisites, introduce notation, choose examples that expose structure, vary one factor at a time, test transfer, and then use the newly formed abstraction to compress the next layer.

LLM pretraining often asks the model to infer the curriculum from the same stream it is supposed to learn from. That is an extraordinary burden.

A more powerful training loop would train the model for some interval, probe what skills and abstractions have formed, identify mastered and frontier regions, select examples that extend those abstractions, maintain enough broad replay to preserve coverage, and repeat.

The model should learn. The sampler should learn too.


6. What would count as a real breakthrough?

The hard part is measuring which examples are pedagogically useful.

Loss is easy to measure. Abstraction is not.

A high-loss example may be useful, or it may be noise. An easy example may be redundant, or it may be the missing bridge that stabilizes a new concept. A rare example may be unimportant, or it may unlock an entire class.

So the target is not simply "train on the hardest examples." It is to find examples with high abstraction yield: examples that make future examples easier to learn.

An example has high abstraction yield if training on it reduces loss across a cluster, improves transfer to held-out variations of the same rule, connects two previously separate representation regions, clarifies a concept boundary, or turns many memorized cases into one compressed rule.

A real breakthrough would not simply produce a better benchmark number. It would show that after discovering an abstraction, the model learns related cases faster.

The clean experiment is not to start with frontier LLMs. Start where the true abstractions are known: modular arithmetic, compositional grammars, symbolic programs, sparse polynomials, and small games. Fit the early IID learning curve. Then introduce a teacher whose curriculum is conditioned on what the student has actually formed, not on the true abstraction known in advance. A real bend would not be a one-time jump; it would be a sustained trajectory that beats the early power-law extrapolation because related cases become cheaper after structure is discovered.

The decisive question is not whether final accuracy improves. It is whether learning one abstraction reduces the sample or compute required to learn a family of related cases. After structure is found, does the next region become cheaper?

If the model merely gets better at the same rate, we have a shift.

If learning one abstraction makes a family of future examples cheaper, we have the beginning of compounding.

A fixed sampler gives us power-law refinement.

A learned sampler gives us curriculum.

An abstraction-aware sampler may give us compounding.

That is the difference between training a model and training a learner.


Papers this essay is thinking with

Kaplan et al., "Scaling Laws for Neural Language Models". Important because it made the straightness of language-model scaling curves practically unavoidable.

Hestness et al., "Deep Learning Scaling is Predictable, Empirically". Important because it showed similar power-law behavior across several domains and observed that model improvements often shift error curves more than change exponents.

Michaud et al., "The Quantization Model of Neural Scaling". Useful because it shows how many discrete learned skills can aggregate into a smooth power law.

Sorscher et al., "Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning". Crucial because it demonstrates that better data selection can, in theory, break beyond power-law dataset scaling and potentially approach exponential scaling; it also reports better-than-power-law scaling in practice on ResNets trained on CIFAR-10, SVHN, and ImageNet.

DoReMi, Data Mixing Laws, RegMix, MATES, and OLMo-style curriculum work. Useful because they show the field moving from data as passive fuel toward data as an optimized control surface.

Grokking work. Important as a counterexample to any simplistic claim that neural networks never discover structure. They can; the open question is whether they can do so generally enough to change frontier-scale learning regimes.