-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[Bugfix] Fix race condition when KV transfer times out before request finishes #26929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… finishes Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a critical race condition in the scheduler. The race condition occurred when a KV transfer timed out, causing a worker to report a request as finished_sending before the scheduler considered the request to be in a finished state. This led to an AssertionError and a crash when attempting to free blocks for a still-running request. The fix introduces robust checks to verify a request's status before attempting to free its blocks, replacing a dangerous assertion with graceful handling and a warning log. A targeted test case has been added to simulate this specific race condition, ensuring the fix is effective and preventing future regressions. The changes are well-implemented and I see no issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
| for req_id in kv_connector_output.finished_sending or (): | ||
| logger.debug("Finished sending KV transfer for request %s", req_id) | ||
| assert req_id in self.requests | ||
| self._free_blocks(self.requests[req_id]) | ||
| request = self.requests.get(req_id) | ||
| if request is None: | ||
| logger.warning( | ||
| "Got finished sending KV transfer for request %s, " | ||
| "but the request is already freed.", | ||
| req_id, | ||
| ) | ||
| elif not request.is_finished(): | ||
| logger.warning( | ||
| "Got finished sending KV transfer for request %s, " | ||
| "but the request is not finished (status=%s). " | ||
| "This may indicate the request was aborted or the KV " | ||
| "transfer timed out before the request completed.", | ||
| req_id, | ||
| request.status, | ||
| ) | ||
| else: | ||
| self._free_blocks(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Track premature KV transfers to avoid leaked blocks
When a worker reports finished_sending before the scheduler marks the request as finished, the new branch only logs a warning and returns. No state is recorded that the transfer has already completed. Once the request does eventually finish, _free_request() still calls _connector_finished(), which typically returns delay_free_blocks=True for remote decode, so _free_blocks() is never invoked unless another finished_sending arrives. Since the worker already emitted its only finished_sending event during the timeout, the request is left in self.requests and its KV blocks remain allocated indefinitely, leaking cache space and preventing the scheduler from recycling memory. The handler needs to persist that the transfer already finished (or immediately free when the request later finishes) rather than merely warn.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question to wonder what will happen if ConnectorScheduler.request_finished() gets called after the request was already reported as finished_sending
markmc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My sense is there's a bug to be fixed here, not papered over
the KV transfer times out before the request is marked finished
That shouldn't happen (AIUI at least). If that is what's happening, I think we should figure out why
| but not actually in a finished state on the scheduler side. | ||
| This can happen when: | ||
| 1. Worker-side NIXL connector times out waiting for decode workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this timeout should only be started once the request has finished
The timeout is started at the delay_free_blocks spot in ConnectorScheduler.request_finished()
| This can happen when: | ||
| 1. Worker-side NIXL connector times out waiting for decode workers | ||
| 2. Worker reports request in finished_sending to prevent stranding blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requests are added to finished_sending by returning them from ConnectorWorker.get_finished() in two cases:
- We've received the required number of xfer notifications
- The timeout expired
Did you mean one of these? Or some other way that a request is added to finished_sending?
(e.g. are you thinking of some connector other than NIXL, or ...?)
| This can happen when: | ||
| 1. Worker-side NIXL connector times out waiting for decode workers | ||
| 2. Worker reports request in finished_sending to prevent stranding blocks | ||
| 3. Scheduler-side request hasn't reached a finished state yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decode side is somehow getting these block IDs before the prefill side has finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking about this possibility - on the prefill side we get notification from decode that the blocks for this request has been transferred. But the request is still not finished on the prefill side ...
Assumption: it impossible for the decode side to be notifying prefill about a request before prefill returns kv_transfer_params, which happens here:
def _free_request(self, request: Request) -> dict[str, Any] | None:
assert request.is_finished()
delay_free_blocks, kv_xfer_params = self._connector_finished(request)
...
if not delay_free_blocks:
self._free_blocks(request)
return kv_xfer_params
If the assumption above is correct, then the scenario looks impossible - the request must be finished before prefill returns kv_transfer_params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another theory - the request is preempted after it finished and is waiting for KV blocks to fetch? Doesn't seem possible - we only choose requests to preempt from Scheduler.running
| "Got finished sending KV transfer for request %s, " | ||
| "but the request is already freed.", | ||
| req_id, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was deliberately removed by #25067 - it's a bug if it happens, the warning is not actionable by users, at best it should be a "fix this bug!" debug statement
| "Got finished sending KV transfer for request %s, " | ||
| "but the request is not finished (status=%s). " | ||
| "This may indicate the request was aborted or the KV " | ||
| "transfer timed out before the request completed.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not actionable by a user - it's either an expected scenario that we can safely ignore (with no logging) or something that's a bug if it happens
| for req_id in kv_connector_output.finished_sending or (): | ||
| logger.debug("Finished sending KV transfer for request %s", req_id) | ||
| assert req_id in self.requests | ||
| self._free_blocks(self.requests[req_id]) | ||
| request = self.requests.get(req_id) | ||
| if request is None: | ||
| logger.warning( | ||
| "Got finished sending KV transfer for request %s, " | ||
| "but the request is already freed.", | ||
| req_id, | ||
| ) | ||
| elif not request.is_finished(): | ||
| logger.warning( | ||
| "Got finished sending KV transfer for request %s, " | ||
| "but the request is not finished (status=%s). " | ||
| "This may indicate the request was aborted or the KV " | ||
| "transfer timed out before the request completed.", | ||
| req_id, | ||
| request.status, | ||
| ) | ||
| else: | ||
| self._free_blocks(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question to wonder what will happen if ConnectorScheduler.request_finished() gets called after the request was already reported as finished_sending
| # reports this request as finished_sending, even though the request | ||
| # is still RUNNING on the scheduler side. | ||
| # This simulates the timeout scenario in NIXL connector. | ||
| kv_connector_output = KVConnectorOutput(finished_sending={request.request_id}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a similar situation to the abort-after-finished race condition in #25067 - I wrote a unit test to artificially simulate it, but it took quite a bit of digging to understand how it could happen and figure out how we wanted to handle it
Was this preceded by |
|
Thanks @markmc I had a similar conclusion that the error in question should not be possible and we should get to the bottom of and fix the root cause rather than adding these new checks. My only theory as to how it could have happened is if the prefiller received more than one request with the same id. |
|
Ok, I think I can reproduce your repeated-request-ID theory @njhill with some effort Here's the server: And I spin up lots of long-random-context requests with the same request ID, hit Ctrl-C, repeat until crash: See req.sh here At some point, I seem to have gotten lucky and (running this PR) got: but mostly I'm seeing this crash |
To answer my own question, these were preceded by: |
This fixes a race condition where the KV transfer times out before the request is marked finished and adds a unit test.
Here's a log from the failure: