- AI = \(\frac{MNK}{MK + KN + MN}\). This assumes perfect reuse and is theoretical, not what hardware actually achieves.
- Hidden Assumption: Infinite-cache model → every tile of (A, B, C) is loaded once and fully reused by all thread blocks.
- Edge Cases: If M=1 or N=1 -> GEMV -> less data reuse → AI < 1 → always memory-bound.
- GEMM becomes compute-bound only when all dimensions are reasonably large. NVIDIA guideline: any dimension \(≤ 128a\) → likely memory-bound.