-
Notifications
You must be signed in to change notification settings - Fork 27
Description
See current V1 workarounds here: vllm-project/vllm#14242
Requests must have both a prompt length and a requested number of tokens that is less than or equal to those same settings on a single warmup shape. If a request matches no warmup shape in this way, it must be rejected.
In the V0 implementation, this constraint is checked by the scheduler and the scheduler marks the request as ignored if it matches no warmup shapes. In V1, this currently does not work because the engine logic does not have any logic to handle requests that are immediately rejected.
We could implement this logic in the engine, or we could explore extending the platform api to allow it to validate requests as they are added, instead of at schedule-time. That alternate approach may allow rejecting requests with a 400-type error instead of returning empty results