Cublaslt Grouped Gemm
), removes this restriction, enabling developers to process "irregular" workloads—such as those found in Mixture-of-Experts (MoE) models or LoRA (Low-Rank Adaptation) fine-tuning—with significantly higher GPU efficiency. Why Grouped GEMM?
For this, NVIDIA introduced specific Grouped GEMM features often utilized via or newer cuBLASLt extensions. In cuBLASLt, if you need variable sizes, you typically must process them in sub-groups of identical sizes or use the cublasLtMatmul with specific "Grouped" descriptors (checking CUBLASLT_NUMERICAL_IMPL_FLAGS or specific Grouped GEMM extensions in the latest CUDA 12.x documentation).
cublasLtMatmulDesc_t matmulDesc; cublasLtMatmulDescCreate(&matmulDesc, CUDA_R_32F, CUDA_R_16F); cublaslt grouped gemm
LLM decoding generates a batch of requests, each with its own sequence length (thus different M dimension for attention projections). Grouped GEMM handles the variable-length attention matrices in one pass.
int m = params.m, n = params.n, k = params.k; float h_alpha = params.alpha; void* workspace = nullptr; size_t workspaceSize = 32 * GitHub Tag:"gpu" | Microsoft Community Hub The set of legal kernel and algorithm choices changes with them. And that is the point most people miss. The runtime is not just r... Microsoft Community Hub 6 sites Accelerating MoE's with a Triton Persistent Cache-Aware Grouped ... Aug 18, 2025 — ), removes this restriction, enabling developers to process
cublasStatus_t cublasLtMatmulGrouped( cublasLtHandle_t handle, cublasLtMatmulAlgo_t const *algo, cublasLtMatmulDesc_t matmulDesc, void *alpha, void *beta, void *A[], // array of pointers to A matrices void *B[], // array of pointers to B matrices void *C[], // array of pointers to C matrices void *D[], // array of pointers to D matrices (often same as C) int32_t groupCount, cublasLtGroupedMatmulPlan_t groupPlan[] // per-operation plans );
// 1. Create Handle cublasLtHandle_t ltHandle; cublasLtCreate(<Handle); In cuBLASLt, if you need variable sizes, you
While standard grouped GEMMs in NVIDIA Docs improve efficiency by launching multiple operations in one kernel, they often suffer from when one large GEMM is grouped with many tiny ones. A load-balanced scheduler would dynamically reassign thread blocks across the GPU to ensure that all Streaming Multiprocessors (SMs) finish their work at roughly the same time. Why this is a "Good Feature":
// Repeat for B and C...
This feature is critical for performance in modern AI workloads, particularly in Mixture of Experts (MoE) models, multi-head attention, or batched inference where matrix sizes vary or are too small to saturate the GPU individually.
The library internally sorts, fuses, and dispatches these operations to the GPU's tensor cores (or CUDA cores) in the most efficient order, minimizing launch overhead and maximizing utilization.