We hit vllm crashing with prompt length being larger of max_model_len (despite we had max_prompt_length set in trinity config)
I was trying to see in the code the effect of max_prompt_length, but it appears that max_prompt_length does not entail dataset filtering / prompt truncation:
https://github.com/search?q=repo%3Amodelscope%2FTrinity-RFT+max_prompt_length&type=code
I propose that max_prompt_length should lead to dataset filtering (or prompt truncation) during Trinity data prep + vllm.LLM(..., truncate_prompt_tokens = max_prompt_length) should also be set as the last resort - to prevent vllm from throwing an exception vllm-project/vllm#16732)
Here is another related thing that seems to have a bug:
self.default_sampling_params = vllm.SamplingParams(
n=1,
temperature=0.0,
max_tokens=config.max_response_tokens,
min_tokens=1,
skip_special_tokens=True,
include_stop_str_in_output=False,
output_kind=RequestOutputKind.FINAL_ONLY,
logprobs=0,
)
The problem is that vllm treats max_tokens as a limit on sum of len(prompt_tokens) + len(response_tokens), not only on response_tokens. So it should not then be surprising when actual len( response_tokens) is always much smaller and never reaches max_response_tokens
So this means that max_response_tokens can be set as high as max_model_len
And also, len(prompt_tokens) + len(response_tokens) are limited by max_model_len (as Qwen3 seems not supporting sliding window in vllm yet).
Here is our original exception from vllm: ERROR 08-17 23:32:15 scheduler.py:86] ValueError: The decoder prompt (length 42861) is longer than the maximum model length of 32768. Make sure that max_model_len is no smaller than the number of text tokens.