Is it scale, or just under-training?

The gap at matched training progress. Pure under-training would put the lines on top of each other. They separate by size instead.
BLT worse above, better below-15%-10%-5%0%0.00.20.40.61B3B13Btokens per parameter (training progress →)val-loss gap (BLT − PP) / PP
At a fixed tokens-per-parameter the bigger model has the larger advantage: an effect of scale, not duration. Two caveats: the 3B line drifts toward zero as it trains, and nothing here passes 0.7 tokens/param, far short of the ~20 that is compute-optimal.