[BUG] After updating to exllamav2-0.1.9 (from 0.1.8) cannot load Mistral Large 2 123B with a draft model #177

Lissanro · 2024-08-24T02:16:03Z

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Describe the bug

I updated TabbyAPI recently, and each time I try to load a large model, I get the following error. I tried loading Mistral Large 2 with Mistral v0.3 7B 3.5bpw as the draft model, a got the following error:

ERROR:    Traceback (most recent call last):
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/endpoints/core/utils/model.py", line 106, in stream_model_load
ERROR:        async for module, modules, model_type in load_status:
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/model.py", line 79, in load_model_gen
ERROR:        async for module, modules in load_status:
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/model.py", line 528, in load_gen
ERROR:        async for value in iterate_in_threadpool(model_load_generator):
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/concurrency.py", line 30, in iterate_in_threadpool
ERROR:        yield await asyncio.to_thread(gen_next, generator)
ERROR:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/asyncio/threads.py", line 25, in to_thread
ERROR:        return await loop.run_in_executor(None, func_call)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR:        result = self.fn(*self.args, **self.kwargs)
ERROR:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/concurrency.py", line 20, in gen_next
ERROR:        return next(generator)
ERROR:               ^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
ERROR:        response = gen.send(request)
ERROR:                   ^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/model.py", line 646, in load_model_sync
ERROR:        for value in self.model.load_autosplit_gen(
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/model.py", line 587, in load_autosplit_gen
ERROR:        module.load()
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/attn.py", line 276, in load
ERROR:        self.q_handle = ext_c.make_q_attn(
ERROR:                        ^^^^^^^^^^^^^^^^^^
ERROR:    RuntimeError: q_proj is wrong shape
ERROR:    Sent to request: q_proj is wrong shape

Then I tried without the draft model, and now the error is different:

ERROR:    Traceback (most recent call last):
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/endpoints/core/utils/model.py", line 106, in stream_model_load
ERROR:        async for module, modules, model_type in load_status:
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/model.py", line 79, in load_model_gen
ERROR:        async for module, modules in load_status:
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/model.py", line 528, in load_gen
ERROR:        async for value in iterate_in_threadpool(model_load_generator):
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/concurrency.py", line 30, in iterate_in_threadpool
ERROR:        yield await asyncio.to_thread(gen_next, generator)
ERROR:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/asyncio/threads.py", line 25, in to_thread
ERROR:        return await loop.run_in_executor(None, func_call)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR:        result = self.fn(*self.args, **self.kwargs)
ERROR:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/concurrency.py", line 20, in gen_next
ERROR:        return next(generator)
ERROR:               ^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
ERROR:        response = gen.send(request)
ERROR:                   ^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/model.py", line 646, in load_model_sync
ERROR:        for value in self.model.load_autosplit_gen(
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/model.py", line 587, in load_autosplit_gen
ERROR:        module.load()
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/attn.py", line 224, in load
ERROR:        self.q_proj.load(device_context = device_context)
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/linear.py", line 152, in load
ERROR:        self.q_handle = ext.make_q_matrix(w,
ERROR:                        ^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/ext.py", line 340, in make_q_matrix
ERROR:        w["q_group_map"] = make_group_map(w["q_groups"], w["q_weight"].shape[0])
ERROR:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/ext.py", line 314, in make_group_map
ERROR:        rows = qrows * 32 // bits
ERROR:               ~~~~~~~~~~~^^~~~~~
ERROR:    ZeroDivisionError: integer division or modulo by zero
ERROR:    Sent to request: integer division or modulo by zero

Strangely enough, if I load Llama 8B, unload and try to load Mistral Large 2 on its own, it loads successfully.

Reproduction steps

Try to load Mistral Large 2 4bpw ( https://huggingface.co/LoneStriker/Mistral-Large-Instruct-2407-4.0bpw-h6-exl2/tree/main ) with https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-exl2/tree/3_5 as the draft model.
Get an error (RuntimeError: q_proj is wrong shape)
Try to load Mistral Large 2 on its own after the first failure, but this time without the draft model.
Get another error (ZeroDivisionError: integer division or modulo by zero)
The error will not go away, even if I try again to load Mistral Large 2 on its own. Only loading and unloading a small model (like Mistral v0.3 7B or Llama 3.1 8B), and then trying to load the large model again will help. Unloading and loading again the large model on its own does not trigger the bug. Only attempt to load with a draft model will trigger it.

In case it matters, I am loading the model via SillyTavern Extension ( https://github.com/theroyallab/ST-tabbyAPI-loader ).

Expected behavior

Both the main and draft models load without errors, instead, getting errors when loading the main model.

Logs

No response

Additional context

Reverting to ecaddec and running ./update_scripts/update_deps.sh (causing downgrade to exllamav2-0.1.8) allows to load both the main and the draft model without issues.

Acknowledgements

I have looked for similar issues before submitting this one.
I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

The text was updated successfully, but these errors were encountered:

Inktomi93 · 2024-08-24T19:35:24Z

I also experience this issue as well and have been able to reproduce it.

Alexey-Akishin · 2024-08-24T22:40:14Z

Yep, can also confirm the problem. After upgrade, can't load big models with draft models anymore. True both for Mistral Large + Mistral 7B, and also true for Llama 70B + Llama 8B. The same issue with fine-tuned large models.

The bug affects many large models, both Llama and Mistral. This is what I get when trying to load Llama 70B with 8B draft model:

INFO:     Attempting to load a prompt template if present.
INFO:     Using template "from_tokenizer_config" for chat completions.
INFO:     Loading draft model: /home/alex/llm/models/Llama-3.1-8B-Instruct-3.0bpw-exl2
INFO:     Loading model: /home/alex/llm/models/Llama-3.1-70B-Instruct-6.0bpw-h6-exl2
INFO:     Loading with autosplit
Loading draft modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67   0:00:00
Loading model modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   1%   1/163 -:--:--
ERROR:    Traceback (most recent call last):
ERROR:      File "/home/alex/llm/tabbyAPI/endpoints/core/utils/model.py", line 106, in stream_model_load
ERROR:        async for module, modules, model_type in load_status:
ERROR:      File "/home/alex/llm/tabbyAPI/common/model.py", line 79, in load_model_gen
ERROR:        async for module, modules in load_status:
ERROR:      File "/home/alex/llm/tabbyAPI/backends/exllamav2/model.py", line 528, in load_gen
ERROR:        async for value in iterate_in_threadpool(model_load_generator):
ERROR:      File "/home/alex/llm/tabbyAPI/common/concurrency.py", line 30, in iterate_in_threadpool
ERROR:        yield await asyncio.to_thread(gen_next, generator)
ERROR:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/asyncio/threads.py", line 25, in to_thread
ERROR:        return await loop.run_in_executor(None, func_call)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR:        result = self.fn(*self.args, **self.kwargs)
ERROR:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/common/concurrency.py", line 20, in gen_next
ERROR:        return next(generator)
ERROR:               ^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
ERROR:        response = gen.send(request)
ERROR:                   ^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/backends/exllamav2/model.py", line 646, in load_model_sync
ERROR:        for value in self.model.load_autosplit_gen(
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/model.py", line 587, in load_autosplit_gen
ERROR:        module.load()
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/attn.py", line 224, in load
ERROR:        self.q_proj.load(device_context = device_context)
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/linear.py", line 152, in load
ERROR:        self.q_handle = ext.make_q_matrix(w,
ERROR:                        ^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/ext.py", line 340, in make_q_matrix
ERROR:        w["q_group_map"] = make_group_map(w["q_groups"], w["q_weight"].shape[0])
ERROR:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/ext.py", line 314, in make_group_map
ERROR:        rows = qrows * 32 // bits
ERROR:               ~~~~~~~~~~~^^~~~~~
ERROR:    ZeroDivisionError: integer division or modulo by zero
ERROR:    Sent to request: integer division or modulo by zero

After hitting the error, cannot load Llama 70B, removing the draft model does not help. But I can still load 8B Llama after this error.

Thanks for sharing workaround, downgrading to ecaddec and running ./update_scripts/update_deps.sh helped. Hopefully this will get fixed soon, I wanted to try the tensor parallel features in the new version (maybe these new features unintentionally caused the issue, but this is just a guess, I am not familiar enough with the code to debug this).

turboderp · 2024-08-25T12:52:46Z

I believe this is a synchronization issue with streams and the custom safetensors loader.

Do you have fasttensors: true in the config? If so, it should be fixed in the exllamav2 dev branch, disabling fasttensors should be a temporary workaround.

Inktomi93 · 2024-08-25T16:53:14Z

I believe this is a synchronization issue with streams and the custom safetensors loader.

Do you have fasttensors: true in the config? If so, it should be fixed in the exllamav2 dev branch, disabling fasttensors should be a temporary workaround.

Switched to dev branch and pulled the update and built from source. After building, loaded with fasttensors: true and got the following error:
RuntimeError: q_proj is wrong shape

But, with fasttensors: false I am able to successfully load both the model and draft model now.

turboderp · 2024-08-25T16:57:04Z

Did you include the latest commit, 7e15947 ?

turboderp · 2024-08-25T19:56:31Z

Okay, I pushed another commit which might help

Inktomi93 · 2024-08-25T21:56:43Z

Okay, I pushed another commit which might help

Just built off the most recent commit and everything seems to work as expected now.

pchristidis · 2024-08-28T08:03:01Z

I cloned+build last night. I can load Mistral-large with the draft model, but it outputs gibberish. Not an issue if I disable the draft model, which I've done, since 22t/s is plenty.

Lissanro · 2024-08-28T14:44:25Z

I updated tabbyAPI, and in exllamav2 folder (using the dev branch) I ran the following commands to install it within TabbyAPI's venv:

../tabbyAPI/venv/bin/pip install -r requirements.txt
../tabbyAPI/venv/bin/pip install .

Now I was able successfully load Mistral-Large-Instruct-2407-123B-5.0bpw-exl2 + Mistral-7B-Instruct-v0.3-exl2-3.5bpw as a draft model. So far, it seems to work correctly (even with Fast Tensors enabled). I also tested with 4bpw version of Mistral Large 2.

In addition to that, I tested with Llama-3.1-70B-Instruct-6.0bpw-h6-exl2 + Llama-3.1-8B-Instruct-3.0bpw-exl2 as a draft model, and that worked too.

@turboderp, thank you very much for fixing this bug.

Lissanro added the bug Something isn't working label Aug 24, 2024

bdashore3 added the exl2 issue Exl2 issue, may be fixed in its dev branch label Aug 26, 2024

Lissanro closed this as completed Aug 28, 2024

atisharma mentioned this issue Sep 23, 2024

[BUG] Draft model section not being read in config.yml #209

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] After updating to exllamav2-0.1.9 (from 0.1.8) cannot load Mistral Large 2 123B with a draft model #177

[BUG] After updating to exllamav2-0.1.9 (from 0.1.8) cannot load Mistral Large 2 123B with a draft model #177

Lissanro commented Aug 24, 2024

Inktomi93 commented Aug 24, 2024

Alexey-Akishin commented Aug 24, 2024

turboderp commented Aug 25, 2024

Inktomi93 commented Aug 25, 2024

turboderp commented Aug 25, 2024

turboderp commented Aug 25, 2024

Inktomi93 commented Aug 25, 2024

pchristidis commented Aug 28, 2024

Lissanro commented Aug 28, 2024

[BUG] After updating to exllamav2-0.1.9 (from 0.1.8) cannot load Mistral Large 2 123B with a draft model #177

[BUG] After updating to exllamav2-0.1.9 (from 0.1.8) cannot load Mistral Large 2 123B with a draft model #177

Comments

Lissanro commented Aug 24, 2024

OS

GPU Library

Python version

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Inktomi93 commented Aug 24, 2024

Alexey-Akishin commented Aug 24, 2024

turboderp commented Aug 25, 2024

Inktomi93 commented Aug 25, 2024

turboderp commented Aug 25, 2024

turboderp commented Aug 25, 2024

Inktomi93 commented Aug 25, 2024

pchristidis commented Aug 28, 2024

Lissanro commented Aug 28, 2024