[Triton] DS fused custom ops #607

k50112113 · 2025-07-02T19:00:37Z

This PR contains the following:

added fused elementwise multiply+addtion in aiter/ops/triton/fused_elementwise.py
added fused RoPE+concat in aiter/ops/triton/fused_concat.py
added fused flatten+mxfp4 quantization and fused RMSnorm+addtion+mxfp4 quantization in aiter/ops/triton/fused_quant.py
fix "2*K" to "K" when reading aiter/ops/triton/gemm_a16w16.py configs
added ouput options and batch size checks in aiter/ops/triton/batched_gemm_afp4wfp4_pre_quant.py
added atomic add version of bf16 gemm in aiter/ops/triton/gemm_a16w16_atomic.py
updated triton mxfp4 gemm and bf16 gemm configs

aiter/ops/triton/fused_concat.py

aiter/ops/triton/fused_elementwise.py

aiter/ops/triton/fused_quant.py

aiter/ops/triton/gemm_a16w16_atomic.py

rahulbatra85

Please see comments. Thanks!

…mains for batched gemm

* add fused concat * add fused elementwise and pytest * clean up * add fused quant * clean up * add fused mxfp4 quant and pytest * add gemm_a16w16_atomic and related tests * extend tests and add config files for bf16 GEMM with atomic add * fused quant. code cleanup * formatting changes * add rms norm quant tests and changes * update/add DS configs for GEMMs * add prequant config * fix bf16 gemm * fix fp4 gemm atomic add * black reformatting * tune fp4 prequant gemm atomic * fix batched fp4 prequant and add new DS configs * optimize shapes for deepseek * black reformatting * update bf16 atomic GEMM * add documentation for fused_concat * update rope qk cat * rename fused elementwise * add doc for fused quant * rename fused_quant into fused_mxfp4_quant * fix typo * fix a minor bug * black formatting * update comments and drop unused arg. * fix pre-quant GEMM tests based on func. sign. changes * fix pre-quant GEMM tests based on func. sign. changes * doc on fused_mul_add * corner case fix * corner case fix * test * Fix pytest errors with LRU cache inplace mod - big test case error remains for batched gemm * Black linting change * Fix linting error * Move inplace config operations to _get_config --------- Co-authored-by: Mehmet Cagri Kaymak <mehmet.kaymak@amd.com> Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com> Co-authored-by: William Zhou <William.Zhou@amd.com>

k50112113 and others added 20 commits July 2, 2025 18:41

add fused concat

95d0bec

add fused elementwise and pytest

3d551f7

clean up

38d2366

add fused quant

ae8b17e

clean up

2e7c081

add fused mxfp4 quant and pytest

ea161bc

add gemm_a16w16_atomic and related tests

9e29cca

extend tests and add config files for bf16 GEMM with atomic add

4a868f9

fused quant. code cleanup

26f468a

formatting changes

d028a7e

add rms norm quant tests and changes

aa7fe7c

update/add DS configs for GEMMs

a128d5b

add prequant config

8bec97a

fix bf16 gemm

2fd002e

fix fp4 gemm atomic add

7e1ff03

black reformatting

f14db5a

tune fp4 prequant gemm atomic

5534a9a

fix batched fp4 prequant and add new DS configs

d0e8c37

optimize shapes for deepseek

8d7fda5

black reformatting

2ccc45b

k50112113 requested review from azaidy and rahulbatra85 July 2, 2025 19:07

cagrikymk added 2 commits July 8, 2025 15:15

update bf16 atomic GEMM

302b717

Merge branch 'main' into shaoclee/ds_fused_custom_ops

c0e7ff6