Cheaper pipeline parallelism that holds up at scale

A frontier model is too big to train on one chip, or even one rack of chips. So it is cut up and spread across racks. One way to cut it is by depth: rack 0 holds the first chunk of layers, rack 1 the next, and so on. A batch flows forward through the racks, a loss is computed at the end, and gradients flow all the way back. This is pipeline parallelism.

It has two costs. The first is idle time. A rack can’t start the backward pass until the forward pass has reached the end and turned around, so racks sit waiting. The more racks you split across, the more of them are idle at any moment. The second cost is traffic. Every gradient has to travel back across every link between racks, and the links between racks are the slowest wires in the building.

Both costs grow with model size, since bigger models need more racks. They are well known; it’s why Ilya Sutskever said “we now know not to do pipeline parallelism,” and why the labs lean on it as little as they can. But depth-splitting is the natural thing to reach for when a model won’t fit, so the question is whether you can keep the split and drop the costs.

The idea is simple: stop doing the full-stack backward pass, and have each rack predict the next token from its own slice. Each rack gets a copy of the output head, computes a next-token loss on its own layers’ output, and backpropagates only into itself. The activation it passes to the next rack is detached, so no gradient ever crosses a rack boundary. No round trip, no bubble, almost nothing on the slow link.

Predicting straight off an early rack’s layers is asking a lot of them, and on its own it hurts. The fix is to give each rack a few extra layers of its own before the output head, a small private decoder I’ll call a coda. The coda does the work of turning a mid-stack representation into a prediction, which the rack’s main layers shouldn’t have to. This is the same trick as a looped language model, which reuses a block of layers to refine a representation before reading it out, except here every rack carries its own copy.

I’ll call the whole thing block-wise local training. That it costs less to run is easy to show. Whether the model is any good, after throwing away the gradient signal that ties the layers together, is the real question, and the answer is more interesting than yes or no.

What it costs to run

The idle time vanishes. After a one-step warmup every rack is busy every step, because no rack waits on a global backward pass that doesn’t exist. On four racks I measured a 38% throughput gain at the same batch size, and the gain grows with rack count. The bubble in standard pipelining goes from a quarter of your GPUs idle at four racks to most of them idle at thirty-two.

The traffic nearly vanishes too. Each rack still sends its activations forward, but it receives no gradients back. Over a fixed window the standard approach moved about 540 GB in each direction across the slow link. Block-wise local training received 38 GB, about fourteen times less. The one thing it could still share is the coda, which the racks could average occasionally; in practice it turns out you can skip even that and just keep each rack’s own.

It also uses less memory, because a rack only holds one batch of activations in flight instead of one per pipeline stage.

These three wins are structural. They are properties of the schedule, not of the optimizer, so they hold at step one and at step a million, at 1B and at 1T. None of them depend on the quality question at all. If the model trained this way is even close to as good, you take the deal. So: is it?

Is the model any good?

I trained the same model both ways at four sizes (1B, 3B, 7B, 13B parameters), holding everything else fixed: same data, same order, same seed, same optimizer, same budget. Then I compared validation loss.

At 1B the local method is 1.5% worse, which is what you’d expect from throwing away gradient information. The penalty shrinks as the model grows, crosses zero past 1B, and by 13B the local method is 15% better on the same budget. The 13B result holds whether you compare at equal training time or equal tokens. In fact the standard method processed more tokens in the same time, 10.3 billion to the local method’s 9.7, and still ended up with higher loss, so this isn’t the local method winning by being faster.

That looks like a clean “gets better with scale” story. But there’s a confound I have to deal with before I’m allowed to tell it.

The honest problem

None of these runs is trained to the point of diminishing returns. Compute-optimal is roughly twenty tokens per parameter; my best-trained run saw under two. And real frontier models now train far past even that, perhaps 100x over Chinchilla, to squeeze more from a fixed model size. So a skeptic has a clean objection: maybe the local method only helps while a model is under-trained, and bigger models in my sweep are simply more under-trained. The advantage would then be a mirage that vanishes at convergence.

The objection bites because the most-trained model I have is the 1B, and the 1B is exactly where the local method loses. The more a run trained, the worse it did. That is the skeptic’s pattern.

The headline numbers can’t separate the two stories, but a different cut can. The two predict different things at matched training progress. Plot the gap against tokens-per-parameter: if it’s under-training, every model collapses onto one curve; if it’s size, the curves stay separated and ordered by size.

They separate, ordered by size, by as much as fifteen points. Under-training can’t produce that, and it’s the one piece of evidence here I’d defend: the effect is about scale.

I won’t oversell it. The same plot cuts both ways. The 3B line drifts toward zero as it trains, so some of the edge does decay. And nothing here passes 0.7 tokens per parameter, so I cannot see convergence. The 13B line is flat over the range I have, but “no sign yet” is not “won’t.”

Why it might be about size

If it is scale, here’s the mechanism I’d bet on. A small model has too few layers to specialize, so every layer does a bit of everything and adjacent layers are tightly coupled. Cut between two tightly-coupled layers and the gradient you discarded was doing real work; the model suffers. A large model has room to specialize. Early layers encode, middle layers compute, late layers decode, and the boundaries between those roles are natural seams. Cut along a seam and the discarded gradient was barely holding anything together; the cut is nearly free.

There’s a second signal for this. Each rack has its own coda, so you can ask how much worse the shallowest rack’s prediction is than the deepest. If early layers produced useless features, that gap would be large. At 1B it’s 0.29 nats; from 3B up it collapses to around 0.1. Bigger models make each rack’s output independently decodable, which is what the specialization story predicts, and it’s measured at matched progress so it doesn’t suffer the under-training confound.

A bonus, for inference

The same structure helps generation. Because every rack can decode its own slice, you can sample as a wavefront: rack 0 predicts a token, hands it down the racks to be refined, and starts on the next token without waiting for the round trip. Tokens are in flight on every rack at once.

That matters because of how decode is priced. Generating one token at a time is memory-bound: you pay to fetch the whole model from memory to produce a single token, and that fetch isn’t amortized over anything. Prefill, which processes many tokens together, is compute-bound and far cheaper per token. The game in serving is to raise decode’s arithmetic intensity until it looks more like prefill, and the lever for that is batching: more tokens sharing each weight fetch. A wavefront stacks neatly with that, since every rack stays busy on a different point in the batch instead of draining between tokens, and the draft tokens fall out for free, because every rack already holds a coda you can speculate with.

I haven’t benchmarked any of this; it’s a projection from the training-side numbers. But the shape is suggestive: the thing that removes the bubble in training also fills the pipeline in inference.

What I’m claiming, and what I’m not

What I’m confident about: the method is much cheaper to run, by properties of the schedule that don’t depend on anything subtle. An order of magnitude less data over the slow link, no idle bubble, less memory. Those are the reasons to care, and they get more valuable as you add racks.

What I’m claiming carefully: at the four sizes I tested, on matched budgets, the local method goes from slightly worse at 1B to clearly better at 13B, and the advantage is ordered by model size even after controlling for how far each run trained. That’s real, and it’s the opposite of how cheap-training tricks usually behave, which is to look good on a 1B toy and collapse at scale.

What I’m not claiming: that this survives to convergence. Every run is far short of compute-optimal, one size shows the gap decaying, and no run reaches the regime a production model would use. The experiment that settles it is to fix one size and train both methods to twenty tokens per parameter. I haven’t run it.

So this isn’t a result, it’s a reason to run that experiment. But it’s a good one. A parallelism scheme that is strictly cheaper to operate, and across every size I could afford does no worse and usually better, is worth chasing down.

Trained on fineweb-edu across 4 nodes of 8×H100. Code, logs, and the scripts behind every figure are on GitHub. The framing of how frontier models are trained and served is from Reiner Pope’s lecture with Dwarkesh Patel.