Skip to content

[Bug]: EAGLE incompatible w/ Compressed Tensors Quantized Target Model #26402

@jmkuebler

Description

@jmkuebler

Your current environment

The output of python collect_env.py
Your output of `python collect_env.py` here

🐛 Describe the bug

When using an EAGLE head with a compressed tensors quantized model the acceptance rate silently fails to zero and therefore the perf completely degrades.

Root Cause: EAGLE layers are registered as further layers in the target model. But in a compressed tensors checkpoint those are not part of the ignore list. Hence vLLM thinks the EAGLE layers are quantized, but they aren't. Surprisingly this does not give an error, but the acceptance rate essentially drops to zero as the EAGLE head is nor rubbish.

The structural problem: The bigger problem IMO is that the EagleProposer simply takes the VllmConfig from the target model. This is not robust whenever the draft model has some different configurations.. In principle that is also the root cause why we required the hacky fix here #25667 to use a non-mm drafter for a mm model.

When did this happen?
[Update]: The PR from which on the bug happens is here #24982

Steps to reproduce:

I am serving Llama 3.18B with the same EAGLE head and switching between BF16 model and the compressed tensors model.

# MODEL_PATH=meta-llama/Llama-3.1-8B-Instruct
MODEL_PATH=RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic# Hugging Face cache location

python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --tensor-parallel-size 1 \
    --max-num-seqs 8 \
    --port 8088 \
    --served-model-name llama-3.1-EAGLE \
    --enforce-eager \
    --speculative_config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 5}' \
    --no-enable-prefix-caching

I then invoke with some lines of Sonnet dataset and look at the logs.

Desired behavior (BF16)

SpecDecoding metrics: 
  Mean acceptance length: 1.99, 
  Accepted throughput: 0.17 tokens/s, 
  Drafted throughput: 0.88 tokens/s, 
  Accepted: 162 tokens, 
  Drafted: 820 tokens, 
  Per-position acceptance rate: 0.616, 0.244, 0.104, 0.018, 0.006, 
  Avg Draft acceptance rate: 19.8%

Bug (using compressed tensors model)

SpecDecoding metrics: 
  Mean acceptance length: 1.00, 
  Accepted throughput: 0.00 tokens/s, 
  Drafted throughput: 13.23 tokens/s, 
  Accepted: 0 tokens, 
  Drafted: 1990 tokens, 
  Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, 0.000, 
  Avg Draft acceptance rate: 0.0%

Hacky fix (using the compressed tensors model)

We can fix this in a hacky way (just to illustrate the root cause) by adding

        # extract layer id
        layer_idx = int(layer_name.split("layers.")[-1].split(".")[0])
        if layer_idx >= 32:
            logger.warning_once(
                f"Setting {layer_name} because it seems part of EAGLE"
            )
            return None

into https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py#L633-L647

then we obtain good acceptance rates again.

SpecDecoding metrics: 
  Mean acceptance length: 2.52, 
  Accepted throughput: 22.40 tokens/s, 
  Drafted throughput: 73.50 tokens/s, 
  Accepted: 224 tokens, 
  Drafted: 735 tokens, 
  Per-position acceptance rate: 0.714, 0.401, 0.224, 0.116, 0.068, 
  Avg Draft acceptance rate: 30.5%

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions