-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement single_query_cached_kv_attention
kernel
#3
Merged
+2,140
−60
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
v1nc3nt27
pushed a commit
to v1nc3nt27/vllm
that referenced
this pull request
Sep 12, 2023
Merge linear layers
xiangyuT
pushed a commit
to xiangyuT/vllm
that referenced
this pull request
Oct 18, 2023
* add multiple requests test * fix
hongxiayang
pushed a commit
to hongxiayang/vllm
that referenced
this pull request
Feb 13, 2024
Spycsh
pushed a commit
to Spycsh/vllm
that referenced
this pull request
Feb 27, 2024
Co-authored-by: Mikhail Dvoretckii <mdvoretckii@habana.ai>
luo-cheng2021
pushed a commit
to luo-cheng2021/vllm
that referenced
this pull request
Mar 12, 2024
Passing alibi_slopes and sliding_window to PagedAttention extension
luo-cheng2021
pushed a commit
to luo-cheng2021/vllm
that referenced
this pull request
Mar 20, 2024
Added dockerfile with vLLM + openvino
mujjingun
added a commit
to gmlwns2000/vllm-timber
that referenced
this pull request
Apr 15, 2024
mzusman
added a commit
to mzusman/vllm
that referenced
this pull request
Apr 16, 2024
* Remove assertion * adapting jamba vllm to changes after hf release, working on weight loading in modeling file * splitting the JambaDecoderLayer to JambaMambaDecoderLayer and JambaAttentionDecoderLayer * weight loading from hf checkpoint supposedly works, might be a mixup in the MoE between the gated and non-gated weights * Add mamba from jamba modeling file * Remove slow forward * Modifications to mamba_mixer * Save changes, WIP * Fix cache placement * Debugging * Additions and logging * Jamba with mamba cache handling * Clean up * Another cleanup * Use vllm's RMSNorm instead of JambaRMSNorm, Thier implementation is with fused kernel * Clean up and orginization of the objects to handle the mamba cache * Shorten the code for kv cache mem * Move cache handling inside the Mixer * Add mamba to the wheel requirements * Add mamba to the requirements script * Add mamba_metadata * Add to __init__ __all__ * Revert 2 commits ad1a3db 'Add mamba to the requirements script' 75ed2c8 'Add mamba to the wheel requirements' * Clean up * Naming * Apply whitespace suggestions from code review * pass tie_word_embeddings to PretrainedConfig init * Replace repeat with expand as expand doesn't require more mem * Allocate really small cache if needed , don't use meta * Fix for expanded --------- Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: Erez Schwartz <erezs@ai21.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
sfc-gh-hazhang
referenced
this pull request
in sfc-gh-hazhang/vllm
May 7, 2024
linxihui
pushed a commit
to linxihui/vllm
that referenced
this pull request
May 14, 2024
…ope-type minor change for LongRoPE config to account for rename from longrope …
Closed
This was referenced Jul 5, 2024
Closed
PanJason
added a commit
to PanJason/vllm
that referenced
this pull request
Sep 21, 2024
* Add example disk swap config. Add unit tests for CC with memory tiering * Layered transfer for DRAM. Transfer in cuda streams * Fix the missing arg * Fix context caching online serving This commit enables layered transmission for DRAM first. Now the transmission is done in different cuda streams. xformers, flash infer and flash attention are supported. Optimized transfer for disk is still pending. Cherry-pick Yangshen's commit --------- Co-authored-by: yangshen <yangshen.d@outlook.com>
PanJason
added a commit
to PanJason/vllm
that referenced
this pull request
Sep 21, 2024
* Add example disk swap config. Add unit tests for CC with memory tiering * Layered transfer for DRAM. Transfer in cuda streams * Fix the missing arg * Fix context caching online serving This commit enables layered transmission for DRAM first. Now the transmission is done in different cuda streams. xformers, flash infer and flash attention are supported. Optimized transfer for disk is still pending. Cherry-pick Yangshen's commit --------- Co-authored-by: yangshen <yangshen.d@outlook.com>
zeroorhero
pushed a commit
to zeroorhero/vllm
that referenced
this pull request
Sep 23, 2024
Optimize the KV transfer pipe implementation
1 task
1 task
1 task
1 task
1 task
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds the
single_query_cached_kv_attention
kernel.Supported data types:
half
float
Tested models:
Tested GPUs: