[BUG] [Qwen] Draft model produce garbage output #674

Nepherpitou · 2024-11-14T21:24:21Z

OS

Windows

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

2.4.1+cu121

Model

Qwen/Qwen2.5-72B-Instruct

Describe the bug

Qwen 2.5 72B Instruct with draft model whenever it's qwen 2.5 0.5B or 1.5B produces garbage. Sometimes it takes few requests to lost context, but it always going insane and inconsistent with repetitions and garbage with long context conversation (15900 tokens). It's going from okay, that's expected through a lot of typos and wow, such chinese! to infiniti repetition of somehow related trash like

...
- **Line **, the).**
- **Line **, the).**
- **Line **, the).**
- **Line **, the).**

Same model without draft produces consistent and good output (with slower tps 😄 )

Reproduction steps

Using TabbyAPI

config.yml

tensor_parallel: true # I have 3090 + 4090
gpu_split_auto: true
gpu_split: [21.0, 24.0] # It will load draft model and half of main model to 4090 with OS overhead and will not fit 24Gb otherwise
cache_mode: Q6 # Q4 doesn't affect anything
chunk_size: 2048
fasttensors: true # tried false as well

draft_cache_mode: Q6 # Q4 also works same

cuda_malloc_backend: true
uvloop: true

Load model

POST http://192.168.1.5:5000/v1/model/load
Authorization: Bearer KEY
Content-Type: application/json

{
  "model_name": "Qwen2.5-72B-Instruct-exl2",
  "draft": {
    "name": "qwen2.5-0.5b-instruct-exl2"
  }
}

Qwen2.5-72B-Instruct-exl2 - 4.0bpw from exllama 2.4.3
Qwen2.5-0.5b-instruct-exl2 - 4.0bpw from exllama 2.4.3

Quants created from original model downloaded at same time today from official Qwen repository.
Qwen2.5-72B-Instruct-exl2 without draft model works fine

Generate chat completitions

I'm using Open Web UI, but I think it doesn't matter a lot.

Here is output from tabby api generation settings (everything at default):

{'request_id': '527a410a9da145a3966cb4e2bf82e4ee', 'max_tokens': 32485, 'min_tokens': 0,
'stream': True, 'token_repetition_penalty': 1.0, 'token_repetition_range': -1, 'token_repetition_decay': 0,
'token_frequency_penalty': 0.0, 'token_presence_penalty': 0.0, 'temperature': 1.0, 'smoothing_factor': 0.0, 'min_temp':
1.0, 'max_temp': 1.0, 'temp_exponent': 1.0, 'top_k': 0, 'top_p': 1.0, 'top_a': 0.0, 'min_p': 0.0, 'tfs': 1.0, 'typical':
1.0, 'skew': 0.0, 'temperature_last': False, 'mirostat': False, 'mirostat_tau': 1.5, 'mirostat_eta': 0.3, 'mirostat_mu':
None, 'token_bias': None, 'cfg_scale': None, 'post_sampling_hooks': [], 'dry_allowed_length': 2, 'dry_base': 1.75,
'dry_multiplier': 0.0, 'dry_sequence_breakers': None, 'dry_range': 0, 'dry_max_ngram': 20, 'ngram_trie': None,
'ngram_index': 0, 'ngram_history': deque([]), 'xtc_probability': 0.0, 'xtc_threshold': 0.1, 'xtc_ignore_tokens': None,
'token_healing': False, 'auto_scale_penalty_range': False, 'generate_window': 4096, 'bos_token_id': 151643,
'eos_token_id': [151645, 151643], 'add_bos_token': True, 'ban_eos_token': False, 'skip_special_tokens': True,
'speculative_ngram': False, 'logprobs': 0, 'stop_conditions': [151645, 151643], 'banned_tokens': [], 'allowed_tokens':
[], 'banned_strings': [], 'logit_bias': None, 'filters': []}

Expected behavior

Draft model doesn't affect quality of output for base model.

Logs

No response

Additional context

I've noticed in models config that Qwen 72B has slightly bigger vocab_size than Qwen 0.5B. Looks like Qwen models from 0.5 to 14 has "vocab_size": 151936 while 32B and 72B "vocab_size": 152064. I don't know if this may affect generation.

Acknowledgements

I have looked for similar issues before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

The text was updated successfully, but these errors were encountered:

turboderp · 2024-11-18T13:57:38Z

Do you get the same results without tensor_parallel?

Nepherpitou · 2024-11-18T19:04:44Z

Do you get the same results without tensor_parallel?

Just tried. Works even better overall than with tensor parallelism. What's the trick :D ?

turboderp · 2024-11-20T16:47:03Z

The TP implementation in ExLlama is a little half-baked. Problem is CUDA isn't great for controlling multiple devices in parallel from a single process, so there needs to be a big rewrite to split inference into multiple processes. Also just a lot of of device synchronization stuff that needs to be optimized. It will happen, at some point, and until then TP is only sort of situationally useful.

Nepherpitou · 2024-11-22T18:50:13Z

After pretty long testing still find ollama/llama.cpp output outperformed exl2 by far. While exl2 running ~20% faster without draft model and almost 100% faster with draft model, it's output in both cases are often corrupted by small mistakes which accumulates quickly. I can't figure out why it so much worse. I'm compared iq4_XS and 4.0bpw quants for 72B, Q6_K and 6.0bpw for 32B - results are same. Maybe I'm missing something? May it be template issue? Or anyhting obvious and simple to mitigate.

Nepherpitou added the bug Something isn't working label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Qwen] Draft model produce garbage output #674

[BUG] [Qwen] Draft model produce garbage output #674

Nepherpitou commented Nov 14, 2024 •

edited

Loading

turboderp commented Nov 18, 2024

Nepherpitou commented Nov 18, 2024

turboderp commented Nov 20, 2024

Nepherpitou commented Nov 22, 2024

[BUG] [Qwen] Draft model produce garbage output #674

[BUG] [Qwen] Draft model produce garbage output #674

Comments

Nepherpitou commented Nov 14, 2024 • edited Loading

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Using TabbyAPI

config.yml

Load model

Generate chat completitions

Expected behavior

Logs

Additional context

Acknowledgements

turboderp commented Nov 18, 2024

Nepherpitou commented Nov 18, 2024

turboderp commented Nov 20, 2024

Nepherpitou commented Nov 22, 2024

Nepherpitou commented Nov 14, 2024 •

edited

Loading