-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Hardware] add platform-specific request validation api #16291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hardware] add platform-specific request validation api #16291
Conversation
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
cc @NickLucche this is was we were discussing a couple of days ago... |
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @joerunde!
| # TODO(woosuk): Support encoder-decoder models. | ||
|
|
||
| from vllm.platforms import current_platform | ||
| current_platform.validate_request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering whether we should remove the call to supports_structured_output and have the default impl of validate_request call that instead. Actually maybe we could remove the supports_structured_output interface method and have validate_request only call it if it exists in the same class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Maybe the simplest thing to do is to just add an impl for the TPU backend and have it reject structured output requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill I went with the 🔥🔥🔥 option- WDYT?
The only difference in behavior now should be that all out-of-tree platforms will need to explicitly reject structured output in validate_request instead of inheriting the default impl of supports_structured_output
vllm/platforms/interface.py
Outdated
| cls, | ||
| prompt: PromptType, | ||
| params: Union[SamplingParams, PoolingParams], | ||
| lora_request: Optional[LoRARequest] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to include lora_request here? Wouldn't a platform either support lora or not, and if not isn't this something that could be checked at startup time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah that's true. I was thinking there might be a case where something about the adapter needs validation but I think you're right that anything about supporting lora would be checked either at boot time, or at adapter load time
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @joerunde looks great ... would be good for @NickLucche to take a look too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice, everybody needs friends! |
|
This PR has broken the benchmark_serving.py command; can we please rollback, or fix? Traceback (most recent call last): |
|
@yarongmu-google could you share the benchmark_serving.py command that failed? I tried a simple command and it worked |
|
python benchmarks/benchmark_serving.py where MODEL is llama3 70B. Note that this is run on a clean machine created only for perf benchamrks. @yaochengji also saw the breakage. Chengji what's your command? |
|
Note that the breakage is on TPU |
I only saw this breakage in CI test, my benchmarking command on llama-8B model is good. |
|
Hmmm .. maybe it's fixed somehow later?? Let's give it a bit more time. Sorry for flooding this PR :) |
I don't think so. It's the latest commit at the moment. |
|
I have posted a fix here, it is specific to TPU V1 #16369 |
|
Shoot, sorry for the breakage! I had wrongly assumed that TPU tests would catch that case during the CI for this PR, before merging to main :( |
…#16291) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: Yang Wang <elainewy@meta.com>
…#16291) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
…#16291) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
…#16291) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
This PR adds a generic
validate_requestapi to the platform interface. This allows platforms to implement any runtime checks on each request to ensure that all the requested features are supported before scheduling it. There is already one existing check in this category,supports_structured_output, and I'd like to avoid a proliferation of more platform apis for individual features like this.Currently, the spyre plugin needs to implement some extra validation around the shape of inputs, as we have more constraints on valid prompt lengths and max token requests. This new api would let us do that without needing to hack around rejecting requests from the scheduler.
FIX vllm-project/vllm-spyre#77