cuda 13 #89

johnnynunez · 2025-09-11T15:42:35Z

Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup

Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0 when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128), leading to a compiler failure during barrier initialization. Changed to round-up division to ensure a minimum value of 1.

…ao-AILab#1460)

This fixes Commit 81cdf4c

Credit: Jay Shah's idea

…Lab#1795) When the parameter `cache_seqlen` is scalar, it should be expand to vector of shape (batch_size). In the original code, whenever `block_table` is used, the shape of `k_cache` is (num_blocks, page_size, ...), and thus `cache_seqlen` is expanded to shape (num_blocks) instead of (batch_size), which is wrong. This fix uses the shape of `q`, which is always `batch_size`.

Actually doesn't seem to make it faster

* use LPT order in varlen kernel * add prefill decode benchmark script * add sort in prepare * add full implementation: * add varlen kvhead swizzle * add settings for swizzle ablation * add correction term for sort when causal * remove ablation options from frontend and clean up comments * add comments in prepare kernel * remove debug code and scripts * put back defaults in tests * remove excess Nones returned in python interface for varlen * revert opinionated change to setup.py on cuda version 12.9 * force inline sort op and make east const * more templating in varlen scheduler to cure some register spilling * fix exploding build by splitting compilation and add qol macros for hdimdiff * fix metadata mismatch with seqlenk in test script * extend prepare kernel to >992 batches and always call it for varlen * do inter-batch sort per every 992 batches * better names in combine and fix prepare condition in api

Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.

When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.

* ci: Move build job to workflow template Signed-off-by: oliver könig <okoenig@nvidia.com> * check out right tag Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Move build job to workflow template Signed-off-by: oliver könig <okoenig@nvidia.com> * check out right tag Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * revert Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Allow build/deploy of arbitrary configurations (Dao-AILab#1827) * ci: Allow build/deploy of arbitrary configurations Signed-off-by: oliver könig <okoenig@nvidia.com> * add Signed-off-by: oliver könig <okoenig@nvidia.com> * cleanui Signed-off-by: oliver könig <okoenig@nvidia.com> * cxx11_abi Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * final Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> * upload Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com>

* lse output * style * style * revert test changes, introduce optional kwarg to output lse

* [BugFix] fix softcap condition softcap should only be referenced when its not none, currently the logic is reversed and will result in an error * [BugFix] fix sm80 cuteDSL error 1. Current condition on softcap is wrong and will result in RuntimeError. Change the code to align with sm_100 2. Make window_size_left and window_size_right optional to align with sm_100 and all other interfaces. * Fix typo of range_constexpr * Fix seqlen

…e DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs

Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0 when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128), leading to a compiler failure during barrier initialization. Changed to round-up division to ensure a minimum value of 1.

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * drop 12.4 * drop 12.4 * fix correct name * fix correct name * fix correct name * fix correct name * cibuildwheel.yml

seemethere · 2025-09-11T16:06:49Z

Do we have a test for this?

Sorry hard to grok exactly what this is changing from the one line change.

I also don't have access to merge PRs in this repository unfortunately.

johnnynunez · 2025-09-11T16:34:00Z

Do we have a test for this?

Sorry hard to grok exactly what this is changing from the one line change.

I also don't have access to merge PRs in this repository unfortunately.

is the same PR that i did here: Dao-AILab@afc97c6

nvcc from cuda 13 take error in that line, so minimum value should be 1

johnnynunez · 2025-09-11T16:35:54Z

oh... it was taking the original fa

* Enable Fwd and Backward Enable Fwd and Backward Enable Fwd and Backward Enable fwd and varlen_fwd on AMD (vllm-project#63) * flash_attn_func works Compress This is a combination of 12 commits. add scripts save add our kernel import our kernel round trip use bshd layout figure out segfault fix show backward failure with prints save backward work run forward only test smallest config on everything add test fix remove pre commit install triton skip dropout pin d 32 factor d just run power of 2 remove timeout run serially clean up clean up 2 * Varlen works This is a combination of 6 commits. save some tests passing enable more enable everything move around alibi works * keep interface and kernel seperate * clean up enable flash_attn_with_kvcache (vllm-project#68) * Compress kvcache work This is a combination of 11 commits. kvcache work This is a combination of 4 commits. kvcache is not supported save save decode save clean up merge save cases save save save save key mask on triton side fix q size issue test combos save * fix causal. use cache_seqlens * clean and test what works * some configs work on new_kv but fails on 1,8 * cache overwrite correct * new_kv works more or less * test local * work on paged kv attention * prefill paged attention * fix has_batch_idx and skip local and rotatary emb * save * save * save * save * handle new_kv when paged kv cache * all except has_batch_idx works * major options are green * test all * add tests * save * clean up * minor clean up * simplest config * save debug true * save * refactor slightly * save work * need key masking * force hip * use is_hip * save * fix cache_seq_len issue * work on new_kv * pass new_kv data * save * benchmark fwd only * disable debug * pandas pdf * save * set methods * record number of heads * use configs * flexiable dim, n-heads, headofdim * better benchmarking * basic inplace update working * works upto 64 * new_kv supported! * test case for has_batch_idx * has_batch_idx works! * save * save * save * save ref * fix mqa and gqa by duplicating * GQA and MQA working by kernel modifications * fix new_kv with gqa * cache index * deal with nans on fwd_splitk * save * causal working on basic case * causal works! * alibi works! * clean up * clean prefill changes * remove bwd stuff * limit decode test to test_op_fwd * add ref * use bfloat Fixes after rebase Fixes after rebase rebase fixes deal with kvcache failure new run for branch cancel-in-progress fix varlen_fwd bug enable packed layouts and all configs (vllm-project#72) Clean up for Upstream (vllm-project#81) * Clean Clean This is a combination of 4 commits. clean 1 clean 2 clean more match main typo fix * use is_hip() * clean up more * skip odd d only * fix bug * skip randomly * use Flag * update readme * remove quantization * remove bwd * minor * print * remove verbose print * qunatize zero's out the d stride Enable Vanilla Bwd and Refactor (vllm-project#86) * Vanilla BWD Vanilla BWD This is a combination of 79 commits. save test_flash_attn_output use impl functions pass layout add ref move arround impls fix stride issue save oai kernel add baseline impl save bwd kernel working remove old impl remove block_ptrs from bwd pass padded dmodel and apply masking. the old test cases work but cases with small d don't work save save more prints rename to M to L save add notes add old_bwd back fa failure fails in kernels too isolate new bwd and keep old bwd in place clean up softmax_lse doesnot match refernce LOG flag softmax_lse with LN2 move qk_scale to loop pass ln2 to fwd just print kernel input test softmax output from forward test exp_scores_triton save all the ref create ref USE_EXP2 path return scores mask scores when returning them. Basic impl test passes scores and output match show max_diff return score needs to be adjusted as we find new maxes all good outputs. old style RCP2 example prep bwd_impl test save try openai save fix softmax_lse bug test_op_bwd_impl starting to work! new kernel. exp2 works but exp is faliing fix bwd exp2 add m and n masks. small cases still don't work match old and new kernel prints compare old and new print inputs save old kernel match on dv dq works compare to pytorch including softmax in forward fix bwd impl bug small sizes in bwd impl work old bwd test pass. Moving on to kernel tests dq, dk and dv are filled in place if given. Need to match cast to match fa fix non bug fix dv mismatch. use_exp2 was set to true in fwd fix case up 128 refactor and clean up a bit more issue is that dq and dk are not zeros dq must be zeroed out ignore segfaults fa ref and my ref match! all tests run use tolerance 1e-3 we need to figure out preprocessing save clean up save test delta diff move old impl out new preprocess function preprocessing_use_o flag working _bwd_preprocess_use_p basic cases pass all green fwd exp2 usage is done right before exp * refactor * refactor 2 * refactor 3 * fix bug * try ci * add flag * rename to utils * skip test_op_fwd_decode_int4_kv * reduce head size * try again * go back to old head sizes * Use Strides Use Strides This is a combination of 11 commits. use strides in bwd add layout test in forward fix shape layout function smaller tests save fix varlen error no headsize passed to bwd deal with varlen layout save save save save * use gen scripts * varlen fwd passing * core fwd ref impl * fix minor bugs * wrap varlen- launcher attention_forward_pytorch_ref_impl * varlen backward ref added * add offsets for varlen * fix delta bug * varlen bwd working * save * runs on Mi200 * just test basics * save * fix bug * fix varlen in64 bug * add ref * test_impl working with causal * fix qkvpacked issue * qkvpacked run tests * remove test_backward * save * just test output * dump into tensors * softmaxlse layout for varlen * small cases working * bwd thd green. although maybe some oom * forward out and lse are good. Something wrong with backward ref * make varlen ref work * save work, ref is working mostly * 91 failed, 6542 passed, 6336 skipped, 1 warning * ref is all green * debug flag in utils * found bad softmax_lse in varlen fwd * fix bug in softmax lse. strides in varlen werenot right * add causal tests and 32*32 bwd doesnot have segfault * save * fix oom by reducing block size for small heads * bwd ref with causal working * test impl * causal test passes * causal working * fix tests * nicer bench * fix qvpacked error * fix varlen qvpacked bug * fix minor bug * bench prefill and prefill_old using the same script * autotune configs for fwd * autotune flag * clean up decode impl * clean up * clean up more * bench everything by default and return time * clean up readmes REBASE: fix interface changes in rebase rename test to test_flash_attn_triton_amd REBASE: fix unpad diffs minor clean up in setup FLASH_ATTENTION_TRITON_AMD flags bench fwd and bwd fix sequence_parallel * clean up * Enable sequence_parallel in bwd (vllm-project#89) * sequence_parallel working on bwd_impl test * fix qkv error * save * save * save * bwd 3 times faster * clean up * fix varlen bug * use copy back dict * fix qkvpacked bug * reduce bench sizes * print copy back * clean more * Autotune off by default * update Triton commit readme (vllm-project#92)

tmm1 and others added 30 commits January 29, 2025 13:27

[Build] Update version of setuptools used to generate core package (D…

cd393e0

…ao-AILab#1460)

Don't compile for CUDA 11, compile for official pytorch 2.6.0

bb135af

Bump to v2.7.4

979702c

Drop Pytorch 2.1

5231d95

[FA3] Compile with nvcc 12.8 instead of 12.3

454ce31

Fix comment in assert

803f609

[CE] Assert logit_scale > 0

02541ac

Implement HeadDim_V != HeadDim_QK, support hdimQK=192, hdimV=128

2a20412

Fix shape_O in epilogue params when kHeadDimV != kHeadDim

6d199aa

Remove old combine.h

86bcd05

Fix loading paged V when kHeadDimV != kHeadDim

e3b2400

Fix shape_V for storing new KV when kHeadDimV != kHeadDim

9e07d6d

Implement the case of LargeHeadDimV

f0f2523

Rename Mma0->MmaQK, Mma1->MmaPV, use Cluster only if hdimV >= 192

4c8819d

Pass _1 or _0 to cute::aligned_struct

dd87691

Fix compilation for FP8 when kHeadDimV != kHeadDim

ed53b5f

Support Qv

4e8496a

Test varlen_q=True by default for kvcache

893a22a

Fix num_splits heuristic being called before get_pack_gqa

5fab938

Fix num_splits heuristic again when PackGQA

5fc5ebf

Tile fwd_combine kernel along headdim, don't need kBlockM > 128

5378bc3

Use bf16 instead of fp16 in benchmark_gemm.py

db8ca79

Update Cutlass to 3.7

982c480

Use nvcc 12.6 but ptxas 12.8

58ebfa5

cicc uses the same version as ptxas

ed435c6

Split hdimdiff into a separate translation unit

8668823

Update benchmark script

b2fc79d

Update Cutlass to 3.8

c091545

Adjust tile size for hdim 64

5e39b10

Adjust ninja build file

1a7f4df

tridao and others added 24 commits August 12, 2025 14:51

[Cute] Implement page table with TMA for fwd_sm100

a1c2e22

[Cute] Remove trailing bracket (Dao-AILab#1809)

581b68d

This fixes Commit 81cdf4c

[Cute] Make sure R2P happen

3c51f15

feat: add support for pytorch2.8 (Dao-AILab#1801)

d2e3fc3

[Cute] Implement PackGQA with TMA for fwd_sm100

69b33b5

Credit: Jay Shah's idea

Bump to v2.8.3

060c918

[Cute] Port fwd_combine kernel from C++ to cute-dsl

b31ae1e

[Cute] Simplify tile scheduler storing params

591dc7e

[Cute] Implement sink for fwd_sm90

f8b4f15

[Cute] Implement PackGQA with TMA for fwd_sm90

e1407db

[Cute] Use R2P for masking in fwd_sm90

0e60e39

Actually doesn't seem to make it faster

Fixes incorrect variable reference in comment (Dao-AILab#1775)

632fe2a

Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.

Update the initialization of dk/dv_semaphore (Dao-AILab#1839)

832d544

When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.

Update tile_scheduler.hpp (Dao-AILab#1841)

478841a

ci: Switch to workflow_dispatch (Dao-AILab#1847)

d0ed097

[FA3] Allow returning LSE via kwarg (Dao-AILab#1851)

203b9b3

* lse output * style * style * revert test changes, introduce optional kwarg to output lse

[FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuT…

6387433

…e DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs

johnnynunez force-pushed the patch-1 branch from 4713e1a to dfb6649 Compare September 11, 2025 16:35

johnnynunez closed this Sep 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda 13 #89

cuda 13 #89

Uh oh!

johnnynunez commented Sep 11, 2025

Uh oh!

seemethere commented Sep 11, 2025

Uh oh!

johnnynunez commented Sep 11, 2025

Uh oh!

johnnynunez commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

35 participants

cuda 13 #89

cuda 13 #89

Uh oh!

Conversation

johnnynunez commented Sep 11, 2025

Uh oh!

seemethere commented Sep 11, 2025

Uh oh!

johnnynunez commented Sep 11, 2025

Uh oh!

johnnynunez commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

35 participants