Skip to content

Conversation

@yannicks1
Copy link
Owner

@yannicks1 yannicks1 commented Sep 24, 2025

This PR enables one last decode step when the context length equals the max model length.
It is a follow up (the second part) of this PR, which (re)enabled token generation on the max model length for prefill.

Note that Hugging Face transformers recently also enabled the same (see this PR).
Therefore this PR establishes consistent behavior of vLLM vs. HF again.

tasks to do:

  • assert that the hf transformers warning This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (see source code here) is not thrown during hf text generation.

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
@yannicks1 yannicks1 deleted the branch enable-prefill-of-max-model-len October 3, 2025 12:25
@yannicks1 yannicks1 closed this Oct 3, 2025
@yannicks1 yannicks1 deleted the enable-decode-of-max-model-len branch October 3, 2025 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants