BLT bubble

One square is one pass on one rack in one step. F0 is microbatch 0's forward, B0 its backward.

The batch is split into microbatches so racks overlap work: F0 to F3 are four microbatches, not four reruns. Left, rack 0 idles from step 3 to 10, waiting for microbatch 0's gradient to return from the deeper racks. That gap is the bubble, and it widens with every rack added. Right, each rack runs its own local backward, so it never waits.

The pipeline bubble, and how block-wise local training removes it