feat: enable `head_dim=256` for attention kernels #132

yzh119 · 2024-02-22T09:02:12Z

As mentioned in #130 , the kernels for head_dim=256 are not compiled by default, this PR expose these attention kernels to pip wheels and adds unittests/benchmarks for head_dim=256.

The way #132 computes `num_kv_chunks` is buggy for short inputs, this PR fixes the issue.

🤖 I have created a release *beep* *boop* --- ## [0.0.3](v0.0.3...v0.1.0) (2024-03-08) ### Features * adding `sm_scale` field for all attention APIs ([#145](#145)) ([85d4018](85d4018)) * enable `head_dim=256` for attention kernels ([#132](#132)) ([0372acc](0372acc)) * pytorch api of fp8 kv-cache ([#156](#156)) ([66ee066](66ee066)) * support ALiBi ([#146](#146)) ([383518b](383518b)) ### Misc * add stream argument in BeginForwardFunction of TVMWrapper ([#164](#164)) ([fabfcb5](https://github.com/flashinfer-ai/flashinfer/tree/fabfcb5751dcc003137a5a7d2d5514f3afe2e302)) ### Bug Fixes * bugfix to pr 135 ([#136](#136)) ([3d55c71](3d55c71)) * fix bugs introduced in [#132](#132) ([#135](#135)) ([9b7b0b9](9b7b0b9)) * fix FindThrust.cmake ([#161](#161)) ([30fa584](30fa584)) ### Performance Improvements * multiple q by sm_scale in decode kernels ([#144](#144)) ([660c559](660c559)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: yzh119 <expye@outlook.com>

yzh119 added 6 commits February 22, 2024 08:59

upd

e240f08

fix tvm wrapper

ad11e68

update pytorch binding

c593f85

bugfix

fdb4850

head_him=256 fused-rope prefill is buggy, will fix tomorrow

51a2250

upd

aae760d

yzh119 mentioned this pull request Feb 23, 2024

[Compiling Issue] error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapper" matches the argument list #134

Closed

yzh119 added 2 commits February 24, 2024 02:01

upd

4853cf7

bugfix

64179c8

yzh119 merged commit 0372acc into main Feb 25, 2024

github-actions bot mentioned this pull request Feb 25, 2024

chore(main): release 0.0.3 #120

Merged

This was referenced Feb 25, 2024

Suppose Gemma model shape #130

Closed

fix: fix bugs introduced in #132 #135

Merged

yzh119 added a commit that referenced this pull request Feb 25, 2024

fix: fix bugs introduced in #132 (#135)

9b7b0b9

The way #132 computes `num_kv_chunks` is buggy for short inputs, this PR fixes the issue.

MasterJH5574 deleted the head-dim-256 branch February 26, 2024 18:12

yzh119 mentioned this pull request Feb 27, 2024

[Roadmap] 0.0.3 Release Checklist #138

Closed

3 tasks

yzh119 mentioned this pull request Mar 6, 2024

Google Gemma running error with half dtype #157

Closed

github-actions bot mentioned this pull request Jul 31, 2024

chore(main): release 0.1.4 #415

Merged

github-actions bot mentioned this pull request Dec 25, 2024

chore(main): release 0.3.0 #698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable `head_dim=256` for attention kernels #132

feat: enable `head_dim=256` for attention kernels #132

yzh119 commented Feb 22, 2024

feat: enable head_dim=256 for attention kernels #132

feat: enable head_dim=256 for attention kernels #132

Conversation

yzh119 commented Feb 22, 2024

feat: enable `head_dim=256` for attention kernels #132

feat: enable `head_dim=256` for attention kernels #132