BLT gap

The gap reverses as the model grows

Below the dashed line, block-wise local training has the lower validation loss.

Same data, seed, optimizer, and wall-clock budget at each size. The iso-token comparison (equal tokens seen) lands within 0.2 points of these.