-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Description
Your current environment
The output of python collect_env.py
I don't currently have access to the machine, but this is reproducible across all vLLM versions from 0.7 to the current 0.11. The most recent I've tested is on 0.11 on an 8XH100 node.
🐛 Describe the bug
While serving microsoft/Phi-3.5-mini-instruct (or any other model with LongRoPE), concurrent requests end up with random token outputs after the LongRoPE transition point from short factor to long factor, in this case, 4K tokens. When a prompt starts below the ~4K transition boundary and continues generating past it, the model switches from the short to long scaling factors mid-sequence. This causes the previous KV-cache to become invalid for the long-format scales, resulting in random tokens beyond that point, since it was encoded with different RoPE parameters. Synchronous requests aren’t affected because they use consistent scaling without shared cache states.
The only current workaround is to modify the model’s config.json to either set the long factor scales equal to the short factor or set the short factor equal to the long factor. This removes the changing scales at the transition point, ensuring consistency throughout the generation.
Related issues:
Reproduction:
Reproducible with any Phi-3.5 variant using the following workload designed to cross the 4K boundary:
vllm serve microsoft/Phi-3.5-mini-instruct
guidellm benchmark \
--target "http://localhost:8080" \
--rate-type concurrent \
--rate 16 \
--data "prompt_tokens=3900,output_tokens=500" \
--max-requests 100For the request outputs stored in the benchmark.json, all of them will start producing random tokens after the 4000th token.
Current Solution/Workaround:
Download the model and set the long_factor and short_factor scales equal to each other, then rerun:
huggingface-cli download microsoft/Phi-3.5-mini-instruct --local-dir ./phi3.5/
jq '.rope_scaling.long_factor = .rope_scaling.short_factor' config.json > config.tmp && mv config.tmp config.json
vllm serve ./phi3.5
guidellm benchmark \
--target "http://localhost:8080" \
--rate-type concurrent \
--rate 16 \
--data "prompt_tokens=3900,output_tokens=500" \
--max-requests 100The request outputs will now be generated without randomness after the 4K token threshold.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.