Skip to content

Conversation

@johnnynunez
Copy link

Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup

Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0 when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128), leading to a compiler failure during barrier initialization. Changed to round-up division to ensure a minimum value of 1.

tmm1 and others added 30 commits January 29, 2025 13:27
tridao and others added 24 commits August 12, 2025 14:51
…Lab#1795)

When the parameter `cache_seqlen` is scalar, it should be expand to
vector of shape (batch_size).  In the original code, whenever `block_table`
is used, the shape of `k_cache` is (num_blocks, page_size, ...), and
thus `cache_seqlen` is expanded to shape (num_blocks) instead of
(batch_size), which is wrong.  This fix uses the shape of `q`, which
is always `batch_size`.
Actually doesn't seem to make it faster
* use LPT order in varlen kernel

* add prefill decode benchmark script

* add sort in prepare

* add full implementation:

* add varlen kvhead swizzle

* add settings for swizzle ablation

* add correction term for sort when causal

* remove ablation options from frontend and clean up comments

* add comments in prepare kernel

* remove debug code and scripts

* put back defaults in tests

* remove excess Nones returned in python interface for varlen

* revert opinionated change to setup.py on cuda version 12.9

* force inline sort op and make east const

* more templating in varlen scheduler to cure some register spilling

* fix exploding build by splitting compilation and add qol macros for hdimdiff

* fix metadata mismatch with seqlenk in test script

* extend prepare kernel to >992 batches and always call it for varlen

* do inter-batch sort per every 992 batches

* better names in combine and fix prepare condition in api
Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.
When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.
* ci: Move build job to workflow template

Signed-off-by: oliver könig <okoenig@nvidia.com>

* check out right tag

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
* ci: Move build job to workflow template

Signed-off-by: oliver könig <okoenig@nvidia.com>

* check out right tag

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Allow build/deploy of arbitrary configurations (Dao-AILab#1827)

* ci: Allow build/deploy of arbitrary configurations

Signed-off-by: oliver könig <okoenig@nvidia.com>

* add

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cleanui

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cxx11_abi

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* final

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>

* upload

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
* lse output

* style

* style

* revert test changes, introduce optional kwarg to output lse
* [BugFix] fix softcap condition

softcap should only be referenced when its not none, currently the logic is reversed and will result in an error

* [BugFix] fix sm80 cuteDSL error


1. Current condition on softcap is wrong and will result in RuntimeError. Change the code to align with sm_100
2. Make window_size_left and window_size_right optional to align with sm_100 and all other interfaces.

* Fix typo of range_constexpr

* Fix seqlen
…e DSL (Dao-AILab#1858)

* update num_threads based on num wgs

* fix bug when not intra_wg_overlap and not mma_pv_is_rs
Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup

Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0
when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128),
leading to a compiler failure during barrier initialization. Changed to round-up
division to ensure a minimum value of 1.
* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* drop 12.4

* drop 12.4

* fix correct name

* fix correct name

* fix correct name

* fix correct name

* cibuildwheel.yml
@seemethere
Copy link

Do we have a test for this?

Sorry hard to grok exactly what this is changing from the one line change.

I also don't have access to merge PRs in this repository unfortunately.

@johnnynunez
Copy link
Author

Do we have a test for this?

Sorry hard to grok exactly what this is changing from the one line change.

I also don't have access to merge PRs in this repository unfortunately.

Do we have a test for this?

Sorry hard to grok exactly what this is changing from the one line change.

I also don't have access to merge PRs in this repository unfortunately.

is the same PR that i did here: Dao-AILab@afc97c6

nvcc from cuda 13 take error in that line, so minimum value should be 1

@johnnynunez
Copy link
Author

oh... it was taking the original fa

sfc-gh-yewang pushed a commit to sfc-gh-yewang/flash-attention that referenced this pull request Oct 13, 2025
* Enable Fwd and Backward

Enable Fwd and Backward

Enable Fwd and Backward

Enable fwd and varlen_fwd on AMD  (vllm-project#63)

* flash_attn_func works

Compress

This is a combination of 12 commits.

add scripts

save

add our kernel

import our kernel

round trip

use bshd layout

figure out segfault

fix

show backward failure with prints

save backward work

run forward only

test smallest config on everything

add test

fix

remove pre commit

install triton

skip dropout

pin d

32 factor d

just run power of 2

remove timeout

run serially

clean up

clean up 2

* Varlen works

This is a combination of 6 commits.

save

some tests passing

enable more

enable everything

move around

alibi works

* keep interface and kernel seperate

* clean up

enable flash_attn_with_kvcache (vllm-project#68)

* Compress kvcache work

This is a combination of 11 commits.

kvcache work

This is a combination of 4 commits.

kvcache is not supported

save

save decode

save

clean up merge

save cases

save

save

save

save

key mask on triton side

fix q size issue

test combos

save

* fix causal. use cache_seqlens

* clean and test what works

* some configs work on new_kv but fails on 1,8

* cache overwrite correct

* new_kv works more or less

* test local

* work on paged kv attention

* prefill paged attention

* fix has_batch_idx and skip local and rotatary emb

* save

* save

* save

* save

* handle new_kv when paged kv cache

* all except has_batch_idx works

* major options are green

* test all

* add tests

* save

* clean up

* minor clean up

* simplest config

* save debug true

* save

* refactor slightly

* save work

* need key masking

* force hip

* use is_hip

* save

* fix cache_seq_len issue

* work on new_kv

* pass new_kv data

* save

* benchmark fwd only

* disable debug

* pandas pdf

* save

* set methods

* record number of heads

* use configs

* flexiable dim, n-heads, headofdim

* better benchmarking

* basic inplace update working

* works upto 64

* new_kv supported!

* test case for has_batch_idx

* has_batch_idx works!

* save

* save

* save

* save ref

* fix mqa and gqa by duplicating

* GQA and MQA working by kernel modifications

* fix new_kv with gqa

* cache index

* deal with nans on fwd_splitk

* save

* causal working on basic case

* causal works!

* alibi works!

* clean up

* clean prefill changes

* remove bwd stuff

* limit decode test to test_op_fwd

* add ref

* use bfloat

Fixes after rebase

Fixes after rebase

rebase fixes

deal with kvcache failure

new run for branch

cancel-in-progress

fix varlen_fwd bug

enable packed layouts and all configs (vllm-project#72)

Clean up for Upstream (vllm-project#81)

* Clean

Clean

This is a combination of 4 commits.

clean 1

clean 2

clean more

match main

typo fix

* use is_hip()

* clean up more

* skip odd d only

* fix bug

* skip randomly

* use Flag

* update readme

* remove quantization

* remove bwd

* minor

* print

* remove verbose print

* qunatize zero's out the d stride

Enable Vanilla Bwd and Refactor (vllm-project#86)

* Vanilla BWD

Vanilla BWD

This is a combination of 79 commits.

save test_flash_attn_output

use impl functions

pass layout

add ref

move arround impls

fix stride issue

save oai kernel

add baseline impl

save bwd kernel working

remove old impl

remove block_ptrs from bwd

pass padded dmodel and apply masking. the old test cases work but cases with small d don't work

save

save

more prints

rename to M to L

save

add notes

add old_bwd back

fa failure fails in kernels too

isolate new bwd and keep old bwd in place

clean up

softmax_lse doesnot match refernce

LOG flag

softmax_lse with LN2

move qk_scale to loop

pass ln2 to fwd

just print kernel input

test softmax output from forward

test exp_scores_triton

save all the ref

create ref USE_EXP2 path

return scores

mask scores when returning them. Basic impl test passes

scores and output match

show max_diff

return score needs to be adjusted as we find new maxes

all good outputs. old style RCP2 example

prep bwd_impl test

save

try openai

save

fix softmax_lse bug

test_op_bwd_impl starting to work!

new kernel. exp2 works but exp is faliing

fix bwd exp2

add m and n masks. small cases still don't work

match old and new kernel prints

compare old and new

print inputs

save

old kernel match on dv

dq works

compare to pytorch including softmax in forward

fix bwd impl bug

small sizes in bwd impl work

old bwd test pass. Moving on to kernel tests

dq, dk and dv are filled in place if given. Need to match cast to match fa

fix non bug

fix dv mismatch. use_exp2 was set to true in fwd

fix case up 128

refactor and clean up a bit more

issue is that dq and dk are not zeros

dq must be zeroed out

ignore segfaults

fa ref and my ref match!

all tests run

use tolerance 1e-3

we need to figure out preprocessing

save

clean up

save

test delta diff

move old impl out

new preprocess function

preprocessing_use_o flag

working _bwd_preprocess_use_p

basic cases pass

all green

fwd exp2 usage is done right before exp

* refactor

* refactor 2

* refactor 3

* fix bug

* try ci

* add flag

* rename to utils

* skip test_op_fwd_decode_int4_kv

* reduce head size

* try again

* go back to old head sizes

* Use Strides

Use Strides

This is a combination of 11 commits.

use strides in bwd

add layout test in forward

fix shape layout function

smaller tests

save

fix varlen error

no headsize passed to bwd

deal with varlen layout

save

save

save

save

* use gen scripts

* varlen fwd passing

* core fwd ref impl

* fix minor bugs

* wrap varlen- launcher attention_forward_pytorch_ref_impl

* varlen backward ref added

* add offsets for varlen

* fix delta bug

* varlen bwd working

* save

* runs on Mi200

* just test basics

* save

* fix bug

* fix varlen in64 bug

* add ref

* test_impl working with causal

* fix qkvpacked issue

* qkvpacked run tests

* remove test_backward

* save

* just test output

* dump into tensors

* softmaxlse layout for varlen

* small cases working

* bwd thd green. although maybe some oom

* forward out and lse are good. Something wrong with backward ref

* make varlen ref work

* save work, ref is working mostly

* 91 failed, 6542 passed, 6336 skipped, 1 warning

* ref is all green

* debug flag in utils

* found bad softmax_lse in varlen fwd

* fix bug in softmax lse. strides in varlen werenot right

* add causal tests and 32*32 bwd doesnot have segfault

* save

* fix oom by reducing block size for small heads

* bwd ref with causal working

* test impl

* causal test passes

* causal working

* fix tests

* nicer bench

* fix qvpacked error

* fix varlen qvpacked bug

* fix minor bug

* bench prefill and prefill_old using the same script

* autotune configs for fwd

* autotune flag

* clean up decode impl

* clean up

* clean up more

* bench everything by default and return time

* clean up readmes

REBASE: fix interface changes in rebase

rename test to test_flash_attn_triton_amd

REBASE: fix unpad diffs

minor clean up in setup

FLASH_ATTENTION_TRITON_AMD flags

bench fwd and bwd

fix sequence_parallel

* clean up

* Enable sequence_parallel in bwd (vllm-project#89)

* sequence_parallel working on bwd_impl test

* fix qkv error

* save

* save

* save

* bwd 3 times faster

* clean up

* fix varlen bug

* use copy back dict

* fix qkvpacked bug

* reduce bench sizes

* print copy back

* clean more

* Autotune off by default

* update Triton commit readme (vllm-project#92)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.