-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
When using an EAGLE head with a compressed tensors quantized model the acceptance rate silently fails to zero and therefore the perf completely degrades.
Root Cause: EAGLE layers are registered as further layers in the target model. But in a compressed tensors checkpoint those are not part of the ignore list. Hence vLLM thinks the EAGLE layers are quantized, but they aren't. Surprisingly this does not give an error, but the acceptance rate essentially drops to zero as the EAGLE head is nor rubbish.
The structural problem: The bigger problem IMO is that the EagleProposer simply takes the VllmConfig from the target model. This is not robust whenever the draft model has some different configurations.. In principle that is also the root cause why we required the hacky fix here #25667 to use a non-mm drafter for a mm model.
When did this happen?
[Update]: The PR from which on the bug happens is here #24982
Steps to reproduce:
I am serving Llama 3.18B with the same EAGLE head and switching between BF16 model and the compressed tensors model.
# MODEL_PATH=meta-llama/Llama-3.1-8B-Instruct
MODEL_PATH=RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic# Hugging Face cache location
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_PATH \
--tensor-parallel-size 1 \
--max-num-seqs 8 \
--port 8088 \
--served-model-name llama-3.1-EAGLE \
--enforce-eager \
--speculative_config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 5}' \
--no-enable-prefix-caching
I then invoke with some lines of Sonnet dataset and look at the logs.
Desired behavior (BF16)
SpecDecoding metrics:
Mean acceptance length: 1.99,
Accepted throughput: 0.17 tokens/s,
Drafted throughput: 0.88 tokens/s,
Accepted: 162 tokens,
Drafted: 820 tokens,
Per-position acceptance rate: 0.616, 0.244, 0.104, 0.018, 0.006,
Avg Draft acceptance rate: 19.8%
Bug (using compressed tensors model)
SpecDecoding metrics:
Mean acceptance length: 1.00,
Accepted throughput: 0.00 tokens/s,
Drafted throughput: 13.23 tokens/s,
Accepted: 0 tokens,
Drafted: 1990 tokens,
Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, 0.000,
Avg Draft acceptance rate: 0.0%
Hacky fix (using the compressed tensors model)
We can fix this in a hacky way (just to illustrate the root cause) by adding
# extract layer id
layer_idx = int(layer_name.split("layers.")[-1].split(".")[0])
if layer_idx >= 32:
logger.warning_once(
f"Setting {layer_name} because it seems part of EAGLE"
)
return Nonethen we obtain good acceptance rates again.
SpecDecoding metrics:
Mean acceptance length: 2.52,
Accepted throughput: 22.40 tokens/s,
Drafted throughput: 73.50 tokens/s,
Accepted: 224 tokens,
Drafted: 735 tokens,
Per-position acceptance rate: 0.714, 0.401, 0.224, 0.116, 0.068,
Avg Draft acceptance rate: 30.5%
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.