-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[NIXL] refactor scheduler->worker request state synchronization #26172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the send timeout tracking for KV block transfers in prefill instances by moving the logic from the worker side to the scheduler side. This simplifies the logic and better handles corner cases. The changes are well-structured and align with the stated goal. However, I've found a critical issue related to a type mismatch that will cause a runtime error. A bytes object is added to a set[str] in the worker, and then the scheduler attempts to decode it, which will fail. I've provided suggestions to fix this by ensuring type consistency across the worker-scheduler interface.
markmc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on what I'm not fully happy with yet
f26afba to
b5796cc
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
b5796cc to
9894745
Compare
NickLucche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @markmc !
Personally the major thing that seems a bit off to me is moving the heterogenous-related stuff from Worker->Scheduler. I was hoping we would be able to keep that transparent from the Scheduler, and let the worker handle that as well as the nixl_notifs, which I believe are meant to stay within "nixl agent's reach" (ie Worker-side).
This is also leading to having to keep consumer_count in a dataclass, which I believe is making the original changes from last week a bit more complicated.
It remains hidden from the scheduler itself ... but yes, it's on the scheduler side of the NIXL connector logic
Nothing jumps out at me as a reason for this being a significant violation of the scheduler/worker split in the connector - the scheduler side is naturally a sort of coordinator, so it being aware that it needs to wait for multiple sent notifications ... doesn't seem breaking any significant encapsulation? Something needs to keep track of the number of notifications, and in the case of an abort, that counter needs to be freed/deleted ... if the worker is that something, then we can't avoid notifying the worker about the abort
For each request, we have a timeout and a consumer count ... I used a tuple initially, but it was a bit gross. It could be two dicts. I dunno ... The changes are more significant, because I've refactored and introduced |
9894745 to
d384771
Compare
|
Looking at moving the consumer notification count stuff back to the worker ... my conclusion is that all of the complexity here comes from having to synchronize the state of requests between the worker and scheduler side, and we will need to do that if the consumer count is tracked by the worker ... so just moving the timeout doesn't achieve anything So I tried some minor refactoring of the state synchronization
|
|
This pull request has merge conflicts that must be resolved before it can be |
d384771 to
63c07ee
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Use a SCHEDULED/FINISHED/ENUM rather than in_batch, to_send, and not_processed. Also move the expiry timestamp calculation to the worker side so we're not sending timestamps across process boundaries. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
63c07ee to
9d47e78
Compare
In a prefill instance, we need to free KV blocks that have not been fetched after a timeout. See #20139.
In #26012, we're trying to deal with corner cases involved with doing this request timeout tracking on the worker side. This PR proposes refactoring the scheduler->worker request state synchronization to use a SCHEDULED/FINISHED/ABORTED enum rather than in_batch, to_send, and not_processed.
Note the expiry timer is switched back to monotonic time because the timestamp is no longer sent across process boundaries.