Skip to content

Conversation

@wseaton
Copy link
Contributor

@wseaton wseaton commented Oct 14, 2025

Purpose

In some situations an operator may not want to allow KV load failure recovery to result in a local prefill on a decode node at all costs. This provides plumbing to make KV load failures bubble up to the api server as a 500 that can be properly handled (either at the proxy layer in a P/D setup, or by clients).

We introduce a new FINISHED_ERROR RequestStatus that the API server process can check for to throw the correct semantic error.

Test Plan

Added unit tests, also manually spun up a 1P/1D H100 deployment using the NixlConnector and injected faults in UCX. PR behaves as expected.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configurable policy for handling KV cache load failures, allowing operators to choose between recomputing failed blocks or aborting the request. The implementation involves adding a new FinishReason.ERROR and RequestStatus.FINISHED_ERROR, updating the scheduler to handle the new policy, and propagating the error up to the OpenAI API layer to return an appropriate error to the client.

The changes are well-structured. However, I've found one critical issue where an internal data structure (FINISH_REASON_STRINGS) was not updated to reflect the new error state, which will lead to an IndexError and a server crash when an error needs to be reported through the API. Please see the detailed comment.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@wseaton wseaton force-pushed the configurable-prefill-recovery branch from 7b72907 to 755e628 Compare October 14, 2025 17:58
@wseaton
Copy link
Contributor Author

wseaton commented Oct 14, 2025

@njhill @NickLucche this is ready for review, also cc @sdavidbd since it interacts with the block level recovery mechanism


# abort and free the request
request.status = RequestStatus.FINISHED_ERROR
kv_transfer_params = self._free_request(request)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use finish_requests() here? At a glance, it would replace much of this logic?

# Mark requests with async KV load failures; they will be rescheduled
# once loading completes.
self.failed_recving_kv_req_ids |= async_affected_req_ids
total_requests_to_reschedule = len(async_affected_req_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"requests to reschedule" no longer seems appropriate naming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed also to "affected" since I think it matches better


# create EngineOutput for the aborted request
outputs[request.client_index].append(
EngineCoreOutput(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS EngineCoreOutput instances are only every created in update_from_output() at the moment - I think it would be nicer if we could maintain that ... just return the aborted request IDs from this function?

kv_load_retry_policy: Literal["recompute", "abort"] = "recompute"
"""Policy for handling KV cache load failures.
'recompute': reschedule the request to recompute failed blocks (default)
'abort': immediately abort the request with an error finish reason"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU #24520 mentions a similar need for policy in the preemption case?

Copy link

@kfirwolfson kfirwolfson Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU #24520 mentions a similar need for policy in the preemption case

More or less. In #24520 we added a field that gives the calling entity (e.g. router) control over how much recompute is allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to think of cache-hit-threshold basically an intermediate option between these two extremes? The impetus for landing this is that nixl_connector now defaults to "recompute" in all cases, and we need that tunable, and more importantly following correct client semantics (eg. not returning empty output)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so. As you suggested offline, the enum can be changed to have a third option of kv_cache threshold. Like I mentioned in another comment, if loading succeeded for the first 95% of the tokens, you may prefer "recompute" rather than "abort" behavior.

for output in outputs:
if output.finish_reason == "error":
logger.error(
"KV cache load failure detected for request %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be so specific here about the cause of the error - can "KV cache load failure" be bubbled up from the lower level and used here?

Copy link
Contributor Author

@wseaton wseaton Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be, it would require making a new specific finish reason (or propogating some type of error context object in the engineoutput to switch on). Do you have a preference on approach? The error context approach I tried previously, there are just not a lot of errors that require custom error handling today at the API server level; I still quite like the idea as a "future infra" investment.

wseaton and others added 29 commits November 4, 2025 14:29
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <me@wseaton.com>
Signed-off-by: Will Eaton <me@wseaton.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
…quest level retryable errors

Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Co-authored-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
…me request

Signed-off-by: Will Eaton <weaton@redhat.com>
…naming; scheduler refactoring

Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
@wseaton wseaton force-pushed the configurable-prefill-recovery branch from 5777b84 to 7d0142c Compare November 4, 2025 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants