The gap at matched training progress. Pure under-training would put the
lines on top of each other. They separate by size instead.
At a fixed tokens-per-parameter the bigger model has the larger advantage:
an effect of scale, not duration. Two caveats: the 3B line drifts toward zero as it trains, and
nothing here passes 0.7 tokens/param, far short of the ~20 that is compute-optimal.