The pipeline bubble, and how block-wise local training removes it

One square is one pass on one rack in one step. F0 is microbatch 0's forward, B0 its backward.
forward passbackward passidle (bubble)
Pipeline parallelismrack 0F0F1F2F3B0B1B2B3rack 1F0F1F2F3B0B1B2B3rack 2F0F1F2F3B0B1B2B3rack 3F0F1F2F3B0B1B2B3time →
Block-wise local trainingrack 0F0B0F1B1F2B2F3B3F4B4F5B5F6B6rack 1F0B0F1B1F2B2F3B3F4B4F5B5F6rack 2F0B0F1B1F2B2F3B3F4B4F5B5rack 3F0B0F1B1F2B2F3B3F4B4F5time →
The batch is split into microbatches so racks overlap work: F0 to F3 are four microbatches, not four reruns. Left, rack 0 idles from step 3 to 10, waiting for microbatch 0's gradient to return from the deeper racks. That gap is the bubble, and it widens with every rack added. Right, each rack runs its own local backward, so it never waits.