Wave Quantization

CUDA
Author

Imad Dabbura

Published

September 15, 2025

Wave quantization is the idle tail you get when the number of GEMM tiles is not divisible by the number of SMs. CUDA executes blocks in waves; any partially filled wave tanks efficiency because the remaining SMs sit idle.

1. Performance (TFLOPs) oscillates in a saw-tooth pattern As N increases, TFLOPs rise smoothly, then suddenly drop at specific N values. Those drops happen exactly when the tile count crosses a multiple of SM count → a new wave begins, but the first wave of that new region is nearly empty.

2. Kernel duration steps upward instead of growing smoothly Duration vs. N is not linear. It increases in plateaus, then jumps sharply. Each jump corresponds to requiring one extra wave. Even if the extra wave contains only a few blocks, the cost of a wave is dominated by its slowest (or only) block.

3. Tile count grows linearly with N, but waves grow discretely The tile count plot is just a line. But every time it crosses a multiple of SM_count, the scheduler must add another wave. That mismatch — linear tiles vs. discrete waves — is the root cause of the performance oscillations.

Minimal synthesis

In conclusion, wave quantization produces predictable cliffs in GEMM performance: