[Nixl] make deletion atomic in nixl request timeout handling #24268

wseaton · 2025-09-04T17:57:08Z

Hotfix for a race condition (double-free) reported in llm-d project here: llm-d/llm-d#187

Test plan: Generating a standalone image build and testing under load, this issue only comes up in high load P/D scenarios.

Signed-off-by: Will Eaton <weaton@redhat.com>

gemini-code-assist

Code Review

This pull request provides a hotfix for a race condition in NIXL timeout handling by making a dictionary deletion atomic. The change is correct, but as noted in the detailed comment, it appears to be incomplete, as a similar race condition persists in another part of the code that modifies the same shared state.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Signed-off-by: Will Eaton <weaton@redhat.com>

wseaton · 2025-09-04T19:09:56Z

/gemini review

gemini-code-assist

Code Review

This pull request aims to fix a race condition leading to a double-free by making the request deletion from _reqs_to_send atomic. The introduction of try_remove_request is a good approach for this. However, the changes in _get_new_notifs introduce a memory leak by processing notifications for untracked requests. I have added a review comment with a suggested fix to prevent this leak while maintaining the atomicity of the deletion.

gemini-code-assist · 2025-09-04T19:12:55Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

                self.consumer_notification_counts_by_req[req_id] += 1
                # Wait all consumers (D) to be done reading before freeing.
                if self.consumer_notification_counts_by_req[req_id] == int(
                        tp_ratio):
-                    notified_req_ids.add(req_id)
                    del self.consumer_notification_counts_by_req[req_id]
-                    del self._reqs_to_send[req_id]
+                    if self.try_remove_request(req_id, "consumer_complete"):
+                        notified_req_ids.add(req_id)
+                    else:
+                        logger.debug(
+                            "Request %s completed by all consumers but was"
+                            "already removed (likely timed out)", req_id)


This change introduces a potential memory leak. By removing the check for req_id in self._reqs_to_send, a notification for a request that has already been completed or timed out will now increment self.consumer_notification_counts_by_req.

If tp_ratio > 1 and this is a late notification for an already-handled request, the condition self.consumer_notification_counts_by_req[req_id] == int(tp_ratio) may not be met. In this case, the entry for req_id will remain in self.consumer_notification_counts_by_req, causing a memory leak.

A check for req_id in self._reqs_to_send should be restored at the beginning of the loop to prevent this. This will also make the else branch of try_remove_request an exceptional case that should be logged as an error.

Suggested change

self.consumer_notification_counts_by_req[req_id] += 1

# Wait all consumers (D) to be done reading before freeing.

if self.consumer_notification_counts_by_req[req_id] == int(

tp_ratio):

notified_req_ids.add(req_id)

del self.consumer_notification_counts_by_req[req_id]

del self._reqs_to_send[req_id]

if self.try_remove_request(req_id, "consumer_complete"):

notified_req_ids.add(req_id)

else:

logger.debug(

"Request %s completed by all consumers but was"

"already removed (likely timed out)", req_id)

if req_id not in self._reqs_to_send:

logger.debug("Ignoring notification for untracked request %s",

req_id)

continue

self.consumer_notification_counts_by_req[req_id] += 1

if self.consumer_notification_counts_by_req[req_id] == int(

tp_ratio):

del self.consumer_notification_counts_by_req[req_id]

if self.try_remove_request(req_id, "consumer_complete"):

notified_req_ids.add(req_id)

else:

logger.error("BUG: Failed to remove request %s", req_id)

Signed-off-by: Will Eaton <weaton@redhat.com>

markmc · 2025-09-09T13:27:28Z

Note the reporter was running v0.10.0

Here's what I see - each prefill server failed with the same pattern - both prefill TP workers received 2 notifications for the same request, apparently after the request has already been deleted from reqs_to_send. The number of notifications checks out - decode is TP=4, so the TP ratio is 2.

It certainly seems like the most straightforward explanation is that the request had already been timed out before we received these notifications. And #21753 was merged in v0.10.1, with what looks like a guard against this situation. With v0.10.1, I expect we'd see Releasing expired KV blocks warnings and may have expired errors instead of the tracebacks?

wseaton · 2025-09-09T14:15:58Z

@markmc good catch, I'll ask the reproducer to do a bisect and test on v0.10.1 and see if we see the expiry warnings pop up, we have an image we can use for this.

markmc · 2025-09-17T11:15:59Z

I think we can close this in favor of #25067

mergify · 2025-09-19T18:42:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wseaton.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

make deletion atomic in nixl timeout handling

d29b2f6

Signed-off-by: Will Eaton <weaton@redhat.com>

wseaton changed the title ~~make deletion atomic in nixl timeout handling~~ Draft: make deletion atomic in nixl timeout handling Sep 4, 2025

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

pop in both places by refactoring

857953c

Signed-off-by: Will Eaton <weaton@redhat.com>

wseaton force-pushed the nixl-timeout-fix branch from 4922107 to 857953c Compare September 4, 2025 19:08

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

add back check even though it's racy to prevent memory leak

aa10294

Signed-off-by: Will Eaton <weaton@redhat.com>

wseaton changed the title ~~Draft: make deletion atomic in nixl timeout handling~~ [Nixl] make deletion atomic in nixl request timeout handling Sep 5, 2025

mergify bot added the kv-connector label Sep 19, 2025

mergify bot added the needs-rebase label Sep 19, 2025

wseaton closed this Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Nixl] make deletion atomic in nixl request timeout handling #24268

[Nixl] make deletion atomic in nixl request timeout handling #24268

Uh oh!

wseaton commented Sep 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

wseaton commented Sep 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Uh oh!

markmc commented Sep 9, 2025

Uh oh!

wseaton commented Sep 9, 2025

Uh oh!

markmc commented Sep 17, 2025

Uh oh!

mergify bot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Nixl] make deletion atomic in nixl request timeout handling #24268

[Nixl] make deletion atomic in nixl request timeout handling #24268

Uh oh!

Conversation

wseaton commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

wseaton commented Sep 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

markmc commented Sep 9, 2025

Uh oh!

wseaton commented Sep 9, 2025

Uh oh!

markmc commented Sep 17, 2025

Uh oh!

mergify bot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wseaton commented Sep 4, 2025 •

edited by github-actions bot

Loading