Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add llama 3.1 style rope #401

Merged
merged 13 commits into from
Jul 27, 2024
Merged

feat: add llama 3.1 style rope #401

merged 13 commits into from
Jul 27, 2024

Conversation

yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Jul 27, 2024

Reference implementation: https://github.com/meta-llama/llama-models/blob/709a61fd810157f75fbb314e7287089eec06d9c3/models/llama3_1/api/model.py#L41

This PR also expose the BatchQKApplyRotaryInPlaceKernel to pytorch APIs, previous they are only used in TVM wrappers.

@yzh119 yzh119 merged commit 4c89dec into main Jul 27, 2024
yzh119 added a commit that referenced this pull request Jul 29, 2024
🤖 I have created a release *beep* *boop*
---

##
[0.1.2](v0.1.1...v0.1.2)
(2024-07-29)

### Bugfix
* Fix the sampling kernel bug for cu118
([#386](#386),
[#387](#387))
([0cd499](0cd4994),
[dc3f18](dc3f184))

### Features

* add llama 3.1 style rope
([#401](#401))
([4c89dec](4c89dec))
* non-inplace rope operators
([#405](#405))
([74ffba1](74ffba1))
* sliding window attention
([#406](#406))
([28cffd3](28cffd3))
* support non-contiguous (packed) input for prefill kernels
([#404](#404))
([68c3719](68c3719))


### Performance Improvements

* slight optimization on merge states
([#313](#313))
([701c813](701c813))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Zihao Ye <expye@outlook.com>
@yzh119 yzh119 deleted the llama-3.1-rope branch August 3, 2024 00:20
@chenzhuofu
Copy link

Awesome!

@chenzhuofu
Copy link

chenzhuofu commented Aug 25, 2024

Looks like llama-3.1-rope hasn't been incoporated into PosEncodingMode, so I think I may explicitly use BatchQKApplyLlama31Rotary and with PosEncodingMode::kNone in AttentionKernel. How do you think of it? @yzh119

@yzh119
Copy link
Collaborator Author

yzh119 commented Sep 1, 2024

@chenzhuofu , yes the wheel size will explode if we take llama 3.1 style rope into PosEncodingMode.

I'm refactoring the codebase to JIT, and the issue should be resolve soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants