Skip to content

Conversation

@joerunde
Copy link
Collaborator

@joerunde joerunde commented Apr 8, 2025

This PR adds a generic validate_request api to the platform interface. This allows platforms to implement any runtime checks on each request to ensure that all the requested features are supported before scheduling it. There is already one existing check in this category, supports_structured_output, and I'd like to avoid a proliferation of more platform apis for individual features like this.

Currently, the spyre plugin needs to implement some extra validation around the shape of inputs, as we have more constraints on valid prompt lengths and max token requests. This new api would let us do that without needing to hack around rejecting requests from the scheduler.

FIX vllm-project/vllm-spyre#77

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
@github-actions
Copy link

github-actions bot commented Apr 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Apr 8, 2025
@njhill
Copy link
Member

njhill commented Apr 8, 2025

cc @NickLucche this is was we were discussing a couple of days ago...

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joerunde!

# TODO(woosuk): Support encoder-decoder models.

from vllm.platforms import current_platform
current_platform.validate_request(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether we should remove the call to supports_structured_output and have the default impl of validate_request call that instead. Actually maybe we could remove the supports_structured_output interface method and have validate_request only call it if it exists in the same class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Maybe the simplest thing to do is to just add an impl for the TPU backend and have it reject structured output requests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill I went with the 🔥🔥🔥 option- WDYT?
The only difference in behavior now should be that all out-of-tree platforms will need to explicitly reject structured output in validate_request instead of inheriting the default impl of supports_structured_output

cls,
prompt: PromptType,
params: Union[SamplingParams, PoolingParams],
lora_request: Optional[LoRARequest] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include lora_request here? Wouldn't a platform either support lora or not, and if not isn't this something that could be checked at startup time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah that's true. I was thinking there might be a case where something about the adapter needs validation but I think you're right that anything about supporting lora would be checked either at boot time, or at adapter load time

joerunde added 2 commits April 8, 2025 16:40
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
@mergify mergify bot added the tpu Related to Google TPUs label Apr 8, 2025
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joerunde looks great ... would be good for @NickLucche to take a look too!

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine work @joerunde thanks!
I already have another check for TPU here #16172 so the structured output exception will feel less lonely.

@joerunde
Copy link
Collaborator Author

joerunde commented Apr 9, 2025

the structured output exception will feel less lonely

Ah nice, everybody needs friends!

@joerunde joerunde added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 9, 2025
@njhill njhill merged commit cb391d8 into vllm-project:main Apr 9, 2025
57 checks passed
@joerunde joerunde deleted the platform-request-validation branch April 9, 2025 19:53
@yarongmu-google
Copy link
Contributor

This PR has broken the benchmark_serving.py command; can we please rollback, or fix?

Traceback (most recent call last):
File "/workspace/vllm/benchmarks/benchmark_serving.py", line 1083, in
main(args)
File "/workspace/vllm/benchmarks/benchmark_serving.py", line 684, in main
benchmark_result = asyncio.run(
File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/workspace/vllm/benchmarks/benchmark_serving.py", line 297, in benchmark
raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Never received a valid chunk to calculate TTFT.This response will be marked as failed!

Repro:
(this) cb391d8-> failed
(one before) fee5b8d-> good

@mgoin
Copy link
Member

mgoin commented Apr 9, 2025

@yarongmu-google could you share the benchmark_serving.py command that failed? I tried a simple command and it worked

vllm serve meta-llama/Llama-3.1-8B-Instruct --port 9000

python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1 --dataset-name random --random-input 1024 --random-output 512 --port 9000
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  6.83      
Total input tokens:                      1024      
Total generated tokens:                  512       
Request throughput (req/s):              0.15      
Output token throughput (tok/s):         74.92     
Total Token throughput (tok/s):          224.77    
---------------Time to First Token----------------
Mean TTFT (ms):                          21.53     
Median TTFT (ms):                        21.53     
P99 TTFT (ms):                           21.53     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.33     
Median TPOT (ms):                        13.33     
P99 TPOT (ms):                           13.33     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.33     
Median ITL (ms):                         10.36     
P99 ITL (ms):                            29.87     
==================================================

@yarongmu-google
Copy link
Contributor

@mgoin

python benchmarks/benchmark_serving.py
--backend vllm
--model $MODEL
--dataset-name sonnet
--dataset-path benchmarks/sonnet_4x.txt
--sonnet-input-len 1800
--sonnet-output-len 128
--ignore-eos

where MODEL is llama3 70B. Note that this is run on a clean machine created only for perf benchamrks.

@yaochengji also saw the breakage. Chengji what's your command?

@yarongmu-google
Copy link
Contributor

Note that the breakage is on TPU

@yaochengji
Copy link
Collaborator

@yaochengji also saw the breakage. Chengji what's your command?

I only saw this breakage in CI test, my benchmarking command on llama-8B model is good.

@yarongmu-google
Copy link
Contributor

Hmmm .. maybe it's fixed somehow later?? Let's give it a bit more time. Sorry for flooding this PR :)

@yaochengji
Copy link
Collaborator

maybe it's fixed somehow later

I don't think so. It's the latest commit at the moment.

@mgoin
Copy link
Member

mgoin commented Apr 10, 2025

I have posted a fix here, it is specific to TPU V1 #16369

@joerunde
Copy link
Collaborator Author

Shoot, sorry for the breakage!

I had wrongly assumed that TPU tests would catch that case during the CI for this PR, before merging to main :(

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
…#16291)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Yang Wang <elainewy@meta.com>
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
…#16291)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add platform api for request validation to reject requests that don't fit warmup shapes

6 participants