The pipeline bubble, and how block-wise local training removes it
One square is one pass on one rack in one step. F0 is microbatch 0's
forward, B0 its backward.
forward passbackward passidle (bubble)
The batch is split into microbatches so racks overlap work: F0 to F3 are four microbatches,
not four reruns. Left, rack 0 idles from step 3 to 10, waiting for microbatch 0's gradient
to return from the deeper racks. That gap is the bubble, and it widens with every rack added.
Right, each rack runs its own local backward, so it never waits.