Cheaper pipeline parallelism that holds up at scale
Pipeline parallelism wastes GPUs on idle time and ships gradients across the slowest link in the system. What if each rack just predicted the next token from its own layers instead of doing the full-stack backward pass? On its own it hurts, but a few private decoder layers per rack fix that. It moves an order of magnitude less data, never stalls, and at 13B it trains a better model than the standard method on the same budget.