The gap reverses as the model grows
Below the dashed line, block-wise local training has the lower validation loss.
equal quality
-15%
-10%
-5%
0%
1B
3B
7B
13B
+1.5%
-1.0%
-2.8%
-15.1%
model size (log scale)
val-loss gap (BLT − PP) / PP
Same data, seed, optimizer, and wall-clock budget at each size. The iso-token comparison (equal tokens seen) lands within 0.2 points of these.