Skip to content

Releases: ROCm/flash-attention

v2.7.0-cktile

15 Nov 05:57
Compare
Choose a tag to compare
  • Reduce LDS usage when num_splits <= 8
  • Use smaller tile size to speed-up small seqlen cases
  • Fine-tune block mapping
  • Use larger vector size for writing workspace
  • Speed-up combine kernel
  • Fix block table read out-of-bound issue
  • Fix wrong key/value range in each splits
  • Not to access dropout seed & offset device pointer in the host api

v2.6.3-cktile

17 Sep 18:52
e2182cc
Compare
Choose a tag to compare

We send the PR to upstream in this PR

  1. Update the ROCm backend (CK), so I modify how to call ck due to changing of CK api.
  2. Improve backward performance by updating the CK (1)
  3. Implement mha_fwd_kvcache().
  4. Change of compile flag to support ROCm6.2
  5. Change bf16 rounding to RTN (round to nearest)

v2.6.2-cktile

14 Aug 10:02
Compare
Choose a tag to compare

This release is the first version of supporting composable kernel tile backend

vllm-v2.5.9post1-90a.942-240719

19 Jul 14:18
23a2b1c
Compare
Choose a tag to compare
Pre-release

This release is created solely for convenient installation by vLLM. The attached wheel is created from the ck_tile branch as of 07/19/2024, with commit hash 23a2b1c2f21, for architectures gfx90a;gfx942, and is designed for use with torch==2.5.0.dev20240710 (this requirement is not strict) and ROCm 6.1.

To install matching version of torch:

python3 -m pip install --no-cache-dir --pre \
                torch==2.5.0.dev20240710 torchvision==0.20.0.dev20240710 \
               --index-url https://download.pytorch.org/whl/nightly/rocm6.1