New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: initial support of logits hook #298

Merged

yzh119 merged 8 commits into main from logits-post-hook

Jun 14, 2024

Collaborator

yzh119 commented Jun 12, 2024

Implement the #257 feature.

yzh119 added 5 commits

June 13, 2024 23:55

upd

ea02a22


          Revert "upd"

dc6b9bf

This reverts commit 7587839.

upd

8532e71

fmt

ef2b222

upd

upd

upd

upd

upd

87b5b66

yzh119 force-pushed the logits-post-hook branch from edc108a to 87b5b66 Compare

June 14, 2024 00:21

yzh119 changed the title ~~[WIP] initial support of logits hook~~ feat: initial support of logits hook

yzh119 added 3 commits

June 14, 2024 06:03

upd

b06ede3

upd

efde73e


          test passed

127f752

yzh119 merged commit ab1e2ad into main

github-actions bot mentioned this pull request

chore(main): release 0.0.5 #232

Merged

yzh119 mentioned this pull request

[Feature request] Support attention logits cap with tanh #257

Closed

yzh119 deleted the logits-post-hook branch

June 14, 2024 21:00

yzh119 added a commit that referenced this pull request


          chore(main): release 0.0.5 (#232)

5c05676

🤖 I have created a release *beep* *boop*
---


##
[0.1.0](v0.0.4...v0.1.0)
(2024-06-20)

### Highlights

* Support any GQA group size support for tensor-cores kernels.
* Support any page size support for tensor-cores kernels.
* Support CUDA-Graph for prefill/decode APIs.
* Add an option to accelerate decode kernels with Tensor Cores.
* Support custom attention mask.
(https://docs.flashinfer.ai/tutorials/kv_layout.html#mask-layout-2d-ragged-tensor)
* Support logits cap in Grok-1 models.
* Fused GPU-sampling kernels: top-p, top-k, speculative verification.
(https://docs.flashinfer.ai/api/python/sampling.html)
* PyTorch wrapper of group-gemm cutlass kernels.
(https://docs.flashinfer.ai/api/python/sampling.html)

### Acknowledgement

We thank [@ibsidorenko](https://github.com/ibsidorenko),
[@LiuXiaoxuanPKU](https://github.com/LiuXiaoxuanPKU),
[@Yard1](https://github.com/Yard1)
[@AgrawalAmey](https://github.com/AgrawalAmey),
[@xuzhenqi](https://github.com/xuzhenqi),
[@mgerstgrasser](https://github.com/mgerstgrasser),
[@esmeetu](https://github.com/esmeetu),
[@yz-tang](https://github.com/yz-tang),
[@HSQ79815](https://github.com/HSQ79815),
[@Qubitium](https://github.com/Qubitium),
[@shreygupta2809](https://github.com/shreygupta2809),
[@sighingnow](https://github.com/sighingnow),
[@vinx13](https://github.com/vinx13),
[@tqchen](https://github.com/tqchen),
[@merrymercy](https://github.com/merrymercy),
[@comaniac](https://github.com/comaniac) and many others for their
contributions and helpful discussions for 0.0.5 release.

### Refactor

* support any GQA group size for tensor-cores kernels
([#301](#301))
([c111ca](c111ca6))
* support any page size for tensor-cores kernels
([#306](#306))
([82fd8c](82fd8c7))


### Features

* add `use_tensor_cores` option to decode kernels to accelerate GQA
([#317](#317))
([3b50dd5](3b50dd5))
* add group gemm operators
([#282](#282))
([e08ba42](e08ba42))
* initial support of distributed operators
([#289](#289))
([03553da](03553da))
* initial support of logits hook
([#298](#298))
([ab1e2ad](ab1e2ad))
* Separate Q and KV dtypes for decode
([#286](#286))
([5602659](5602659))
* support cuda graph for batched multi-query(prefill/append) attention
([#275](#275))
([83ceb67](83ceb67))
* support cuda graph for batched multi-query(prefill/append) attention
([#277](#277))
([24cc583](24cc583))
* support custom attention mask in prefill/append attention kernels
([#266](#266))
([7304282](7304282))
* fused speculative sampilng kernels
([#259](#259))
([cea2bb](cea2bb9))
* expose sampling APIs in pytorch
([#238](#238))
([092902](0929023))


### Performance Improvements

* initial cuda graph support
([#256](#256))
([7e9cc7f](7e9cc7f))
* split kv-cache for prefill/append kernels
([#310](#310))
([f0bb0a3](f0bb0a3))
* use packed bit array for attention mask
([#308](#308))
([3d43dc9](3d43dc9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Zihao Ye <expye@outlook.com>

github-actions bot mentioned this pull request

chore(main): release 0.1.4 #415

Merged

github-actions bot mentioned this pull request

chore(main): release 0.3.0 #698

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet