-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[Nixl] make deletion atomic in nixl request timeout handling #24268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Will Eaton <weaton@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a hotfix for a race condition in NIXL timeout handling by making a dictionary deletion atomic. The change is correct, but as noted in the detailed comment, it appears to be incomplete, as a similar race condition persists in another part of the code that modifies the same shared state.
Signed-off-by: Will Eaton <weaton@redhat.com>
4922107 to
857953c
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a race condition leading to a double-free by making the request deletion from _reqs_to_send atomic. The introduction of try_remove_request is a good approach for this. However, the changes in _get_new_notifs introduce a memory leak by processing notifications for untracked requests. I have added a review comment with a suggested fix to prevent this leak while maintaining the atomicity of the deletion.
| self.consumer_notification_counts_by_req[req_id] += 1 | ||
| # Wait all consumers (D) to be done reading before freeing. | ||
| if self.consumer_notification_counts_by_req[req_id] == int( | ||
| tp_ratio): | ||
| notified_req_ids.add(req_id) | ||
| del self.consumer_notification_counts_by_req[req_id] | ||
| del self._reqs_to_send[req_id] | ||
| if self.try_remove_request(req_id, "consumer_complete"): | ||
| notified_req_ids.add(req_id) | ||
| else: | ||
| logger.debug( | ||
| "Request %s completed by all consumers but was" | ||
| "already removed (likely timed out)", req_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces a potential memory leak. By removing the check for req_id in self._reqs_to_send, a notification for a request that has already been completed or timed out will now increment self.consumer_notification_counts_by_req.
If tp_ratio > 1 and this is a late notification for an already-handled request, the condition self.consumer_notification_counts_by_req[req_id] == int(tp_ratio) may not be met. In this case, the entry for req_id will remain in self.consumer_notification_counts_by_req, causing a memory leak.
A check for req_id in self._reqs_to_send should be restored at the beginning of the loop to prevent this. This will also make the else branch of try_remove_request an exceptional case that should be logged as an error.
| self.consumer_notification_counts_by_req[req_id] += 1 | |
| # Wait all consumers (D) to be done reading before freeing. | |
| if self.consumer_notification_counts_by_req[req_id] == int( | |
| tp_ratio): | |
| notified_req_ids.add(req_id) | |
| del self.consumer_notification_counts_by_req[req_id] | |
| del self._reqs_to_send[req_id] | |
| if self.try_remove_request(req_id, "consumer_complete"): | |
| notified_req_ids.add(req_id) | |
| else: | |
| logger.debug( | |
| "Request %s completed by all consumers but was" | |
| "already removed (likely timed out)", req_id) | |
| if req_id not in self._reqs_to_send: | |
| logger.debug("Ignoring notification for untracked request %s", | |
| req_id) | |
| continue | |
| self.consumer_notification_counts_by_req[req_id] += 1 | |
| if self.consumer_notification_counts_by_req[req_id] == int( | |
| tp_ratio): | |
| del self.consumer_notification_counts_by_req[req_id] | |
| if self.try_remove_request(req_id, "consumer_complete"): | |
| notified_req_ids.add(req_id) | |
| else: | |
| logger.error("BUG: Failed to remove request %s", req_id) |
Signed-off-by: Will Eaton <weaton@redhat.com>
|
Note the reporter was running v0.10.0 Here's what I see - each prefill server failed with the same pattern - both prefill TP workers received 2 notifications for the same request, apparently after the request has already been deleted from reqs_to_send. The number of notifications checks out - decode is TP=4, so the TP ratio is 2. It certainly seems like the most straightforward explanation is that the request had already been timed out before we received these notifications. And #21753 was merged in v0.10.1, with what looks like a guard against this situation. With v0.10.1, I expect we'd see |
|
@markmc good catch, I'll ask the reproducer to do a bisect and test on v0.10.1 and see if we see the expiry warnings pop up, we have an image we can use for this. |
|
I think we can close this in favor of #25067 |
|
This pull request has merge conflicts that must be resolved before it can be |
Hotfix for a race condition (double-free) reported in llm-d project here: llm-d/llm-d#187
Test plan: Generating a standalone image build and testing under load, this issue only comes up in high load P/D scenarios.