Skip to content

Conversation

@wseaton
Copy link
Contributor

@wseaton wseaton commented Sep 4, 2025

Hotfix for a race condition (double-free) reported in llm-d project here: llm-d/llm-d#187

Test plan: Generating a standalone image build and testing under load, this issue only comes up in high load P/D scenarios.

Signed-off-by: Will Eaton <weaton@redhat.com>
@wseaton wseaton changed the title make deletion atomic in nixl timeout handling Draft: make deletion atomic in nixl timeout handling Sep 4, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a hotfix for a race condition in NIXL timeout handling by making a dictionary deletion atomic. The change is correct, but as noted in the detailed comment, it appears to be incomplete, as a similar race condition persists in another part of the code that modifies the same shared state.

Signed-off-by: Will Eaton <weaton@redhat.com>
@wseaton
Copy link
Contributor Author

wseaton commented Sep 4, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a race condition leading to a double-free by making the request deletion from _reqs_to_send atomic. The introduction of try_remove_request is a good approach for this. However, the changes in _get_new_notifs introduce a memory leak by processing notifications for untracked requests. I have added a review comment with a suggested fix to prevent this leak while maintaining the atomicity of the deletion.

Comment on lines 1066 to 1076
self.consumer_notification_counts_by_req[req_id] += 1
# Wait all consumers (D) to be done reading before freeing.
if self.consumer_notification_counts_by_req[req_id] == int(
tp_ratio):
notified_req_ids.add(req_id)
del self.consumer_notification_counts_by_req[req_id]
del self._reqs_to_send[req_id]
if self.try_remove_request(req_id, "consumer_complete"):
notified_req_ids.add(req_id)
else:
logger.debug(
"Request %s completed by all consumers but was"
"already removed (likely timed out)", req_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change introduces a potential memory leak. By removing the check for req_id in self._reqs_to_send, a notification for a request that has already been completed or timed out will now increment self.consumer_notification_counts_by_req.

If tp_ratio > 1 and this is a late notification for an already-handled request, the condition self.consumer_notification_counts_by_req[req_id] == int(tp_ratio) may not be met. In this case, the entry for req_id will remain in self.consumer_notification_counts_by_req, causing a memory leak.

A check for req_id in self._reqs_to_send should be restored at the beginning of the loop to prevent this. This will also make the else branch of try_remove_request an exceptional case that should be logged as an error.

Suggested change
self.consumer_notification_counts_by_req[req_id] += 1
# Wait all consumers (D) to be done reading before freeing.
if self.consumer_notification_counts_by_req[req_id] == int(
tp_ratio):
notified_req_ids.add(req_id)
del self.consumer_notification_counts_by_req[req_id]
del self._reqs_to_send[req_id]
if self.try_remove_request(req_id, "consumer_complete"):
notified_req_ids.add(req_id)
else:
logger.debug(
"Request %s completed by all consumers but was"
"already removed (likely timed out)", req_id)
if req_id not in self._reqs_to_send:
logger.debug("Ignoring notification for untracked request %s",
req_id)
continue
self.consumer_notification_counts_by_req[req_id] += 1
if self.consumer_notification_counts_by_req[req_id] == int(
tp_ratio):
del self.consumer_notification_counts_by_req[req_id]
if self.try_remove_request(req_id, "consumer_complete"):
notified_req_ids.add(req_id)
else:
logger.error("BUG: Failed to remove request %s", req_id)

Signed-off-by: Will Eaton <weaton@redhat.com>
@wseaton wseaton changed the title Draft: make deletion atomic in nixl timeout handling [Nixl] make deletion atomic in nixl request timeout handling Sep 5, 2025
@markmc
Copy link
Member

markmc commented Sep 9, 2025

Note the reporter was running v0.10.0

Here's what I see - each prefill server failed with the same pattern - both prefill TP workers received 2 notifications for the same request, apparently after the request has already been deleted from reqs_to_send. The number of notifications checks out - decode is TP=4, so the TP ratio is 2.

It certainly seems like the most straightforward explanation is that the request had already been timed out before we received these notifications. And #21753 was merged in v0.10.1, with what looks like a guard against this situation. With v0.10.1, I expect we'd see Releasing expired KV blocks warnings and may have expired errors instead of the tracebacks?

@wseaton
Copy link
Contributor Author

wseaton commented Sep 9, 2025

@markmc good catch, I'll ask the reproducer to do a bisect and test on v0.10.1 and see if we see the expiry warnings pop up, we have an image we can use for this.

@markmc
Copy link
Member

markmc commented Sep 17, 2025

I think we can close this in favor of #25067

@mergify mergify bot added the kv-connector label Sep 19, 2025
@mergify
Copy link

mergify bot commented Sep 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wseaton.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 19, 2025
@wseaton wseaton closed this Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants