-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/evict req on client disconnect streaming case #223
Draft
bhimrazy
wants to merge
32
commits into
Lightning-AI:main
Choose a base branch
from
bhimrazy:feat/evict-req-on-client-disconnect-streaming-case
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
d613565
chore: Add SimpleDelayedStreamAPI for delayed streaming of output
bhimrazy 371cf56
add test_stream_client_disconnection
bhimrazy 9e7f841
add request_evicted_status param to run_streaming_loop
bhimrazy 7ce49ac
update test_stream_client_disconnection
bhimrazy 56c8587
adds functionality to evict the request if disconnected before comple…
bhimrazy f5961c4
Merge branch 'main' into feat/evict-req-on-client-disconnect-streamin…
bhimrazy 9330997
update exception
aniketmaurya 1f0bfe5
fix test
aniketmaurya d41db3c
Update src/litserve/server.py
aniketmaurya 4344720
Merge branch 'main' into feat/evict-req-on-client-disconnect-streamin…
aniketmaurya f177fcb
Merge branch 'main' into feat/evict-req-on-client-disconnect-streamin…
bhimrazy 4e5045a
Merge branch 'main' into feat/evict-req-on-client-disconnect-streamin…
bhimrazy 1d4677c
Merge branch 'main' into feat/evict-req-on-client-disconnect-streamin…
bhimrazy ca6fbc2
reverted changes to new updates
bhimrazy e61cdab
update
bhimrazy 6c2e0c6
update
bhimrazy 6668cc8
Merge branch 'main' into feat/evict-req-on-client-disconnect-streamin…
bhimrazy 3448ef3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 2cfd68e
chore: Add test for streaming client disconnection
bhimrazy c95ee45
handle client disconnection streaming nonbatched case
bhimrazy bac5534
chore: Optimize streaming loop performance by checking for client dis…
bhimrazy f08ed4b
chore: Update streaming loop to include request eviction status
bhimrazy e060e39
Merge branch 'main' into feat/evict-req-on-client-disconnect-streamin…
bhimrazy 2b11fc7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 5323c51
Refactor inference_worker function to remove optional parameters and …
bhimrazy 4368b57
update
bhimrazy 611a751
update
bhimrazy 5cc0f77
add missing param
bhimrazy 8d4a05d
add missing param
bhimrazy bd68b6c
add missing param for run streaming loop
bhimrazy 56f1076
test by removing the check interval
bhimrazy 49bed55
so there is performance drop with this check,
bhimrazy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the
request_evicted_status
for each token appears to have a significant impact, reducing performance from 3600 to around 3100. However, it may not be necessary to perform this check on every token.While adding a check interval helps reduce the overhead and brings the performance closer to that of the main branch, but it still doesn't feel like an ideal solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your patience with the PR and checking the speed issue @bhimrazy 🙌 .
yeah, and in case when the time-to-first-token is large but rest of the token stream speed is fast, it doesn't help much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just single worker. with multiple workers it might impact even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the overall design is correct, we are just way too aggressive checking the distributed dict and we get into contention problems.
One alternative that could reduce contention is getting a snapshot of the disconnected dictionary in every worker loop: so not use a managed dict but a shared value that the server publishes and that gets read as a whole by each worker periodically (every N seconds - we don't need a thread, we just check the time at every loop). This way every worker has a semi-up to date local dictionary that it can check as often as we want.
Having semi-up to date info on who disconnected every N seconds is totally fine, we don't need to react immediately.
This design also helps with ignoring items in the queue that come from clients that have been disconnected. For those we necessarily have to check at every request. If the local dictionary is not up to date we'll run some requests for nothing, but that's ok. One caveat is making sure the responses don't accumulate in the response dictionary on the webserver process, in this case (let's remember about this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @lantiga, for the valuable insights. This approach seems promising. I'll take some time to study the concept and work on the implementation shortly.