forked from mlc-ai/mlc-llm
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel sampling eviction #157
Merged
sunggg
merged 42 commits into
octoml:batch-serving
from
masahi:parallel-sampling-eviction
Feb 2, 2024
Merged
Changes from all commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
e0ef4c6
add new model for evaluating logits over multiple queries using KV cache
masahi 4ccbb27
add test
masahi f1314a5
clean
masahi 2bee022
Only the number of past tokens is needed
masahi 756b09f
fix build
masahi 09ef5b3
fix
masahi 7b67ba4
correctly handle num_past_tokens > sliding_window case
masahi e0517fd
wip
masahi cf89a5b
blac
masahi 9ca4806
wip
masahi 4541b4d
wip
masahi 5d376d2
remove cancel call back in eviction
masahi 59c36cc
Create MultiQueryDecodeRequest
masahi f58acf7
only the number of past tokens is needed
masahi d9dd2ca
wip
masahi cb11761
wip
masahi 24f7bfa
wip
masahi 34da221
fix
masahi d94e9d8
wip
masahi 4a3bb77
wip
masahi 0c6875e
wip
masahi a46abe1
wip
masahi c80bea2
working?
masahi 18239a4
remove dbg print
masahi fd2b2bd
multi gpu works
masahi 6ac292b
fixed sliding window logic
masahi 2f9d1f7
remove dbug print
masahi 3a9f6d6
clean and fix
masahi 9fb9261
mypy
masahi 906b23b
generate signature update
masahi 2c1aa04
Merge branch 'batch-serving' into parallel-sampling-eviction
masahi b197e71
more
masahi 2dfa28d
fix mypy
masahi e287c5f
fix
masahi 417750c
Merge branch 'batch-serving' into parallel-sampling-eviction
masahi c925c52
fix
masahi a4d6e01
mypy fix
masahi 7360392
Merge branch 'batch-serving' into parallel-sampling-eviction
masahi 5dbf73e
refactor
masahi 78a6f77
fix
masahi 9189697
rename
masahi d4fe2d7
Disallow preempting when a request has generated more than max_num_ba…
masahi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunggg @elvin-n Please be aware of this limitation. Due to this, there is still a case when a parallel-sampling request is cancelled rather than preempted.
In general, we don't have a good solution for preempting a request which has generated more than
max_num_batched_tokens
tokens. See also #163. The easiest solution would be to stop generation atmax_num_batched_tokens
, but then we cannot support "unlimited" generation.