The gap reverses as the model grows

Below the dashed line, block-wise local training has the lower validation loss.
equal quality-15%-10%-5%0%1B3B7B13B+1.5%-1.0%-2.8%-15.1%model size (log scale)val-loss gap (BLT − PP) / PP
Same data, seed, optimizer, and wall-clock budget at each size. The iso-token comparison (equal tokens seen) lands within 0.2 points of these.