[Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple #494
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces several changes across multiple files to enhance functionality, improve modularity, and add new features for dequantization and flash attention operations. The most significant updates include the addition of new examples for dequantization and flash attention, refactoring of existing scripts to improve usability, and updates to the
tilelanglibrary to support these operations.New Examples and Features:
Dequantization Enhancements:
examples/dequantize_gemm/example_dequant_gemv_fp16xint4.pyfor dequantizing GEMV operations with support for various configurations, including fast decoding and scaling. This includes a comprehensive implementation of thedequantize_gemvfunction and a testablemainfunction.examples/dequantize_gemm/example_dequant_gemm_fp4_hopper.pyto encapsulate logic within amainfunction, improving modularity. The script now accepts parameters via themainfunction or command-line arguments. [1] [2]Flash Attention Enhancements:
examples/flash_attention/example_mha_fwd_bhsd_wgmma_pipelined.pyfor flash attention with pipelined execution. This implementation includes kernel macros for matrix multiplication, softmax, and rescaling, along with support for autotuning configurations.Testing and Licensing:
examples/dequantize_gemm/test_example_dequantize_gemm.pyto validate the functionality of the dequantization examples for different configurations.Library Updates:
tilelang/quantize/__init__.pyto expose additional utility functions and intrinsics for quantization, such asinterleave_weightandget_lop3_intrin_group.DataTypeintilelang/__init__.pyto resolve potential issues with type handling.