BLT scale vs undertraining

The gap at matched training progress. Pure under-training would put the lines on top of each other. They separate by size instead.

At a fixed tokens-per-parameter the bigger model has the larger advantage: an effect of scale, not duration. Two caveats: the 3B line drifts toward zero as it trains, and nothing here passes 0.7 tokens/param, far short of the ~20 that is compute-optimal.

Is it scale, or just under-training?