Progressive Intelligence
Progressive Intelligence
Why AI may push the web toward a new delivery contract
Imagine opening a site, typing immediately, and getting a decent answer in two seconds. Not the best answer the system will ever be capable of-just one that is coherent enough to keep your train of thought alive. While you read, the system improves in the background. By the next turn it has more depth, more precision, or broader modality support. You never wait for the full stack to become ready before thinking. You just begin, and the system meets you there.
Not full capability delayed. Not a nicer loading screen. Not a dropdown where the user pre-selects a model tier.
Rather: early usefulness followed by stronger later turns.
That experience has a name in this essay: progressive intelligence-staged capability delivery across a session where the early state is already worth using and the later state inherits the earlier one rather than merely replacing it.
That distinction matters. A system can answer quickly with one path, answer better later with another, and still fail to be truly progressive if the stronger path cannot continue the weaker one's internal work. In that case, the product may feel smooth, but the architecture is still doing restart behavior-reprocessing from scratch behind a polished surface.
The core claim here is that AI may push the web toward a new delivery contract, not because models keep getting better, but because useful intelligence is becoming too heavy to deliver as a lump. If that is true, the frontier is not just model quality. It is the design of systems that let capability arrive in stages without breaking the session.
The bottleneck changed
The web beat desktop not because it was the best execution environment, but because it was the best distribution environment. A URL collapsed discovery, install, updates, and collaboration into one interface. It made software easier to arrive at than to install.
Platforms usually win by minimizing the dominant friction of their era. For the web, that friction was software logistics.
AI changes that.
Once software becomes intelligence, the challenge is no longer just getting code to the user. It is getting useful capability to the user under constraints like startup latency, memory movement, serving cost, and continuity. The bottleneck shifts from software logistics to inference logistics.
That shift changes the real design question. The question is no longer only "what is the best model?" It is also "what is the best way to deliver evolving capability in use?"
Much of today's discourse still frames this as a question of location-should AI be remote or local? But that misses the point. Remote systems and local systems each optimize for real things, but neither addresses the deeper problem: how should capability be delivered over the life of a session under real constraints?
If intelligence becomes too heavy to deliver as a lump, then the winning systems may be the ones that learn how to arrive in layers.
The hard part is not loading more model. It is preserving continuity.
At first glance, this looks like a straightforward scaling trick. Start with a smaller path, answer immediately, then add more model later.
But that picture is too shallow in two different ways.
The first problem is at the front edge: the startup floor is not only about reasoning depth. A model can have enough logic to begin answering and still be too heavy to start quickly, because too much of the minimum runnable state is tied up in representation.
Base interaction is not the same thing as base reasoning.
A model's reasoning depth can vary. Its representational substrate-its ability to map tokens into a coherent semantic space, and to remain executable as a model-has to exist in some form from the beginning. This is why the naive answer, "just load the most important words first," does not work. If the tokenizer emits a token whose representation the model cannot support, you do not get a blurrier version of the same model. You get an invalid one.
So even the first-stage system is harder than it looks. It has to be not just smaller, but real: executable, coherent, and worth using.
The second problem is at the back edge, and it is deeper.
The real wall is continuity.
You have probably already felt what happens when continuity breaks. You are three turns into a conversation. The system has been tracking your constraints, your tone, your intent. Then something shifts. The next answer ignores what you just said, or contradicts an earlier response, or feels like you are suddenly talking to a different mind. The session did not deepen. It reset-and you can feel it.
That is what happens, at the product level, when a stronger path arrives but cannot inherit what the weaker path was doing. A stronger path is not automatically a continuation of a weaker one. If hidden size changes, the residual stream changes shape. If KV-cache structure changes, prior conversational state may no longer be directly reusable. If routing or representational geometry changes too much, the later system may be "better" in general while still being unable to inherit the earlier system's actual live state.
This is the crucial distinction:
- staged loading sounds progressive
- staged replacement looks progressive
- true progressive intelligence requires continuity across capability upgrades
Without that, the session is not really deepening. It is just swapping.
This matters because the most seductive version of the idea is also the weakest one: "small model first, big model later." That can produce a nice UX. It can hide latency. It can demo well. But if the stronger system must reconstruct the session from visible text rather than inheriting the earlier path's state, then the architecture has not actually solved the thing that matters.
The systems problem the idea lives or dies on is this: intermediate states must be:
- executable
- useful
- continuity-preserving
The economics look like they reward this direction, but it is not settled
The strongest objection to this whole thesis is that it might just be an elegant way of describing transitional engineering pain.
Maybe hosted inference gets so fast, cheap, and persistent that none of this matters strategically. Maybe remote systems become "progressive enough" in practice. Maybe better compression solves startup pain without requiring any deeper rethinking of session architecture.
That is a serious objection.
There is strong evidence that the cost to reach a fixed level of model capability has been falling quickly. Epoch AI shows sharp drops in LLM inference prices, though the drops are uneven by task, and recent work on price-performance argues that the cost required to reach a given benchmark level has been falling roughly 5x to 10x per year across several benchmark categories. (Epoch AI)
If that trend dominates everything else, much of the urgency behind this essay weakens.
But the relevant economic unit is not cost per token or cost per benchmark point. It is closer to:
What does it cost to deliver a satisfying session?
That reframing changes the picture. The price of a unit of capability falls. But the number of units a useful session consumes rises at the same time. Longer reasoning traces, more tool use, multimodal work, more persistent agent behavior, and more ambitious workflows all push demand upward. The Price of Progress reports that despite large price-performance gains, the cost of running benchmarks in its dataset has often stayed flat or increased because larger models and greater reasoning demand offset per-unit efficiency gains. (arXiv)
That is the gap this idea lives in: per-unit cost falls, but per-session cost resists falling at the same rate.
The question is not whether models are getting cheaper. They are. The question is whether useful interactive intelligence remains heavy enough, latency-sensitive enough, or session-shaped enough that staged delivery stays strategically valuable. Epoch AI's discussion of the "inference cost burden" points in the same direction: capability-adjusted costs can fall quickly while frontier systems still remain inference-hungry in practice. (Epoch AI)
That answer is not settled.
And that is why the idea is live.
So where is this starting to show up?
The claim is not that progressive intelligence already exists in mature form. The claim is narrower: the field is beginning to produce pieces of the stack that make the direction credible.
MatFormer matters because it breaks the fixed-checkpoint assumption. It shows that one trained model can expose multiple smaller submodels rather than forcing each deployment size to be a separate artifact. In their experiments, an 850M language model yielded a family of extracted submodels from 582M up to the full model. That does not solve progressive intelligence, but it does make "related capability tiers inside one model family" a real architectural object. (arXiv)
Gemma 3n pushes the idea further into practice. Google's public documentation explicitly describes structurally related capability tiers, including E2B inside E4B and intermediate "Mix-n-Match" configurations between them. That is meaningful not because it gives us the full browser-native progressive runtime, but because it suggests model families can be designed as staged capability spaces, not just disconnected small/medium/large checkpoints. (Google Developers Blog)
Gemma 4 is useful as a contrast. Its smaller models emphasize Per-Layer Embeddings for parameter efficiency and on-device deployability. That is valuable progress, but it is a different axis of progress. A model can become dramatically easier to run on-device without giving us the stronger property this essay cares about: intermediate states that are executable, useful, and continuity-preserving across capability upgrades. Efficient deployability is not yet the same thing as a true progressive runtime. (Google AI for Developers)
That is the right update to make from the evidence:
- stop assuming deployment size is the only useful abstraction
- start noticing model families as staged capability spaces
- do not confuse on-device efficiency with true progressiveness
- start looking for runtime semantics, not just better packaging
The architecture is beginning to hint at the possibility. The runtime semantics still lag behind the intuition.
That gap is where this idea becomes interesting.
What this changes if you are building
If the preceding evidence is directionally right-if session-level delivery is a real design problem, not just a transitional one-then product and runtime teams should be asking different questions.
Product teams should stop treating time to first token as the whole startup metric. The real question is not just "how fast did the system respond?" but "how quickly did the system become good enough to preserve momentum?" A fast but weak first answer can still derail the user's thinking. A still-prompt but directionally useful one can keep the session alive. That suggests different product metrics: time to useful first answer, second-turn quality, and whether the session feels like it is genuinely deepening rather than merely continuing.
Product teams should also stop designing around isolated prompt-response pairs and start designing around session arcs. If capability can deepen over time, then the design unit changes from single-response quality to session trajectory quality. That introduces different questions: when should capability escalate, what should be front-loaded versus warmed, and what kinds of later-turn improvement actually matter to users?
Runtime teams should stop thinking only in checkpoints and start thinking in execution states. If capability is going to deepen over the life of a session, then the runtime needs to reason about minimum valid early states, compatibility across upgrades, inheritance rather than replay, and how much of a session can truly be preserved. A checkpoint-centric system serving two sizes is still not a progressive runtime.
The optimization target may also change. The relevant unit may no longer be just page load, cost per token, or benchmark quality per dollar. It may become something closer to quality of capability delivered over the life of a session. That shifts attention from static efficiency to dynamic orchestration: which bytes or compute buy the most visible in-session improvement, when should the system deepen rather than front-load, and how should teams reason about session economics rather than request economics?
What would make this real
This thesis strengthens if three things happen.
First, users have to care about the experience of beginning: first-turn usability, second-turn improvement, continuity, and reduced waiting.
Second, the technical stack has to move in this direction: elastic model families, staged capability paths, and runtimes that preserve continuity rather than merely swapping models.
Third, the economics have to keep rewarding staged delivery: enough pressure from latency, device limits, serving cost, and session shape that "full capability upfront" remains meaningfully expensive or awkward.
If hosted systems become so fast, cheap, and stateful that they already feel progressive enough, if compression solves most startup pain, and if session demands stop outrunning efficiency gains, then progressive intelligence may remain a useful framing with limited strategic force.
But if products start competing on how intelligence arrives, and if continuity becomes a real differentiator rather than a backend nuisance, then this stops being a nice metaphor and starts becoming a real design principle.
Conclusion
The old web contract was built for a world where the hard part was getting software to the user.
AI changes that.
The hard part is increasingly getting useful capability to the user without making them wait, without exhausting the device, and without breaking continuity as the system gets stronger.
Once an artifact becomes too heavy to deliver as a lump, the systems that survive are usually the ones that learn how to arrive in layers.
That has been true for pages. It has been true for media. It has been true for applications.
It may turn out to be true for intelligence too-though intelligence may demand a harder version of the same principle.