-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Closed
Labels
usageHow to use vllmHow to use vllm
Description
Your current environment
env: 16*H800
model:deepseekr1
version:0.7.2
start scrpts:python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 80 --max-model-len 128000 --trust-remote-code --pipeline-parallel-size 2 --tensor-parallel-size 8 --gpu-memory-utilization 0.8 --served-model-name deepseek --model /mnt/workspace/models/public-models/llm/DeepSeek-R1
“WARNING 02-16 15:37:32 scheduler.py:947] Input prompt (2501 tokens) is too long and exceeds limit of 2048
INFO 02-16 15:37:32 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.3 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
”
How would you like to use vllm
Which parameter of the startup command can solve the problem
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
usageHow to use vllmHow to use vllm