13B head-to-head

Same wall-clock, same data, same seed. BLT is lower on both axes, and it wins per-token even though pipeline parallelism saw more tokens.
by tokens2.503.734.956.187.400.02.65.27.710.3tokens trained (B)validation loss (nats)pipeline parallelBLT
by training time2.503.734.956.187.400.011.022.033.044.0wall-clock (hours)validation loss (nats)pipeline parallelBLT
Held-out validation loss. The gap opens after the first billion tokens.