13B head-to-head
Same wall-clock, same data, same seed. BLT is lower on both axes, and it wins per-token even though pipeline parallelism saw more tokens.
by tokens
2.50
3.73
4.95
6.18
7.40
0.0
2.6
5.2
7.7
10.3
tokens trained (B)
validation loss (nats)
pipeline parallel
BLT
by training time
2.50
3.73
4.95
6.18
7.40
0.0
11.0
22.0
33.0
44.0
wall-clock (hours)
validation loss (nats)
pipeline parallel
BLT
Held-out validation loss. The gap opens after the first billion tokens.