make DeepGEMM swapAB available for linear gemm SM90 #2101
+803
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📌 Description
In flashinfer we already had fp8_gemm_kernel_swapAB kernel for optimizing Mixture of Experts (MOE) GEMM and Dense GEMM operations reference 1, reference 2, and reference 3.
This kernel improves performance in small batch scenarios by swapping the input order in matrix multiplication.
These kernels are currently used for:
MoE operations (exposed via fused_moe module)
Available in the codebase for Dense GEMM but not exposed for linear/dense layers
This PR aims to
csrc/fp8_blockscale_gemm_sm90_binding.cuflashinfer/jit/gemm/flashinfer/gemm/gemm_base.pyflashinfer/tests/gemm/test_fp8_blockscale_gemm.pyTODO
🔍 Related Issues
vLLM 28427
vLLM 28316
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes