Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] After updating to exllamav2-0.1.9 (from 0.1.8) cannot load Mistral Large 2 123B with a draft model #177

Closed
4 tasks done
Lissanro opened this issue Aug 24, 2024 · 9 comments
Labels
bug Something isn't working exl2 issue Exl2 issue, may be fixed in its dev branch

Comments

@Lissanro
Copy link

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Describe the bug

I updated TabbyAPI recently, and each time I try to load a large model, I get the following error. I tried loading Mistral Large 2 with Mistral v0.3 7B 3.5bpw as the draft model, a got the following error:

ERROR:    Traceback (most recent call last):
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/endpoints/core/utils/model.py", line 106, in stream_model_load
ERROR:        async for module, modules, model_type in load_status:
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/model.py", line 79, in load_model_gen
ERROR:        async for module, modules in load_status:
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/model.py", line 528, in load_gen
ERROR:        async for value in iterate_in_threadpool(model_load_generator):
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/concurrency.py", line 30, in iterate_in_threadpool
ERROR:        yield await asyncio.to_thread(gen_next, generator)
ERROR:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/asyncio/threads.py", line 25, in to_thread
ERROR:        return await loop.run_in_executor(None, func_call)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR:        result = self.fn(*self.args, **self.kwargs)
ERROR:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/concurrency.py", line 20, in gen_next
ERROR:        return next(generator)
ERROR:               ^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
ERROR:        response = gen.send(request)
ERROR:                   ^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/model.py", line 646, in load_model_sync
ERROR:        for value in self.model.load_autosplit_gen(
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/model.py", line 587, in load_autosplit_gen
ERROR:        module.load()
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/attn.py", line 276, in load
ERROR:        self.q_handle = ext_c.make_q_attn(
ERROR:                        ^^^^^^^^^^^^^^^^^^
ERROR:    RuntimeError: q_proj is wrong shape
ERROR:    Sent to request: q_proj is wrong shape

Then I tried without the draft model, and now the error is different:

ERROR:    Traceback (most recent call last):
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/endpoints/core/utils/model.py", line 106, in stream_model_load
ERROR:        async for module, modules, model_type in load_status:
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/model.py", line 79, in load_model_gen
ERROR:        async for module, modules in load_status:
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/model.py", line 528, in load_gen
ERROR:        async for value in iterate_in_threadpool(model_load_generator):
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/concurrency.py", line 30, in iterate_in_threadpool
ERROR:        yield await asyncio.to_thread(gen_next, generator)
ERROR:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/asyncio/threads.py", line 25, in to_thread
ERROR:        return await loop.run_in_executor(None, func_call)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR:        result = self.fn(*self.args, **self.kwargs)
ERROR:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/common/concurrency.py", line 20, in gen_next
ERROR:        return next(generator)
ERROR:               ^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
ERROR:        response = gen.send(request)
ERROR:                   ^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/backends/exllamav2/model.py", line 646, in load_model_sync
ERROR:        for value in self.model.load_autosplit_gen(
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/model.py", line 587, in load_autosplit_gen
ERROR:        module.load()
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/attn.py", line 224, in load
ERROR:        self.q_proj.load(device_context = device_context)
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/linear.py", line 152, in load
ERROR:        self.q_handle = ext.make_q_matrix(w,
ERROR:                        ^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/ext.py", line 340, in make_q_matrix
ERROR:        w["q_group_map"] = make_group_map(w["q_groups"], w["q_weight"].shape[0])
ERROR:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/lissanro/pkgs/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/ext.py", line 314, in make_group_map
ERROR:        rows = qrows * 32 // bits
ERROR:               ~~~~~~~~~~~^^~~~~~
ERROR:    ZeroDivisionError: integer division or modulo by zero
ERROR:    Sent to request: integer division or modulo by zero

Strangely enough, if I load Llama 8B, unload and try to load Mistral Large 2 on its own, it loads successfully.

Reproduction steps

  1. Try to load Mistral Large 2 4bpw ( https://huggingface.co/LoneStriker/Mistral-Large-Instruct-2407-4.0bpw-h6-exl2/tree/main ) with https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-exl2/tree/3_5 as the draft model.

  2. Get an error (RuntimeError: q_proj is wrong shape)

  3. Try to load Mistral Large 2 on its own after the first failure, but this time without the draft model.

  4. Get another error (ZeroDivisionError: integer division or modulo by zero)

  5. The error will not go away, even if I try again to load Mistral Large 2 on its own. Only loading and unloading a small model (like Mistral v0.3 7B or Llama 3.1 8B), and then trying to load the large model again will help. Unloading and loading again the large model on its own does not trigger the bug. Only attempt to load with a draft model will trigger it.

In case it matters, I am loading the model via SillyTavern Extension ( https://github.com/theroyallab/ST-tabbyAPI-loader ).

Expected behavior

Both the main and draft models load without errors, instead, getting errors when loading the main model.

Logs

No response

Additional context

Reverting to ecaddec and running ./update_scripts/update_deps.sh (causing downgrade to exllamav2-0.1.8) allows to load both the main and the draft model without issues.

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@Lissanro Lissanro added the bug Something isn't working label Aug 24, 2024
@Inktomi93
Copy link

I also experience this issue as well and have been able to reproduce it.

@Alexey-Akishin
Copy link

Yep, can also confirm the problem. After upgrade, can't load big models with draft models anymore. True both for Mistral Large + Mistral 7B, and also true for Llama 70B + Llama 8B. The same issue with fine-tuned large models.

The bug affects many large models, both Llama and Mistral. This is what I get when trying to load Llama 70B with 8B draft model:

INFO:     Attempting to load a prompt template if present.
INFO:     Using template "from_tokenizer_config" for chat completions.
INFO:     Loading draft model: /home/alex/llm/models/Llama-3.1-8B-Instruct-3.0bpw-exl2
INFO:     Loading model: /home/alex/llm/models/Llama-3.1-70B-Instruct-6.0bpw-h6-exl2
INFO:     Loading with autosplit
Loading draft modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67   0:00:00
Loading model modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   1%   1/163 -:--:--
ERROR:    Traceback (most recent call last):
ERROR:      File "/home/alex/llm/tabbyAPI/endpoints/core/utils/model.py", line 106, in stream_model_load
ERROR:        async for module, modules, model_type in load_status:
ERROR:      File "/home/alex/llm/tabbyAPI/common/model.py", line 79, in load_model_gen
ERROR:        async for module, modules in load_status:
ERROR:      File "/home/alex/llm/tabbyAPI/backends/exllamav2/model.py", line 528, in load_gen
ERROR:        async for value in iterate_in_threadpool(model_load_generator):
ERROR:      File "/home/alex/llm/tabbyAPI/common/concurrency.py", line 30, in iterate_in_threadpool
ERROR:        yield await asyncio.to_thread(gen_next, generator)
ERROR:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/asyncio/threads.py", line 25, in to_thread
ERROR:        return await loop.run_in_executor(None, func_call)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR:        result = self.fn(*self.args, **self.kwargs)
ERROR:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/common/concurrency.py", line 20, in gen_next
ERROR:        return next(generator)
ERROR:               ^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
ERROR:        response = gen.send(request)
ERROR:                   ^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/backends/exllamav2/model.py", line 646, in load_model_sync
ERROR:        for value in self.model.load_autosplit_gen(
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/model.py", line 587, in load_autosplit_gen
ERROR:        module.load()
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/attn.py", line 224, in load
ERROR:        self.q_proj.load(device_context = device_context)
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:               ^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/linear.py", line 152, in load
ERROR:        self.q_handle = ext.make_q_matrix(w,
ERROR:                        ^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/ext.py", line 340, in make_q_matrix
ERROR:        w["q_group_map"] = make_group_map(w["q_groups"], w["q_weight"].shape[0])
ERROR:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR:      File "/home/alex/llm/tabbyAPI/venv/lib/python3.12/site-packages/exllamav2/ext.py", line 314, in make_group_map
ERROR:        rows = qrows * 32 // bits
ERROR:               ~~~~~~~~~~~^^~~~~~
ERROR:    ZeroDivisionError: integer division or modulo by zero
ERROR:    Sent to request: integer division or modulo by zero

After hitting the error, cannot load Llama 70B, removing the draft model does not help. But I can still load 8B Llama after this error.

Thanks for sharing workaround, downgrading to ecaddec and running ./update_scripts/update_deps.sh helped. Hopefully this will get fixed soon, I wanted to try the tensor parallel features in the new version (maybe these new features unintentionally caused the issue, but this is just a guess, I am not familiar enough with the code to debug this).

@turboderp
Copy link
Collaborator

I believe this is a synchronization issue with streams and the custom safetensors loader.

Do you have fasttensors: true in the config? If so, it should be fixed in the exllamav2 dev branch, disabling fasttensors should be a temporary workaround.

@Inktomi93
Copy link

I believe this is a synchronization issue with streams and the custom safetensors loader.

Do you have fasttensors: true in the config? If so, it should be fixed in the exllamav2 dev branch, disabling fasttensors should be a temporary workaround.

Switched to dev branch and pulled the update and built from source. After building, loaded with fasttensors: true and got the following error:
RuntimeError: q_proj is wrong shape

But, with fasttensors: false I am able to successfully load both the model and draft model now.

@turboderp
Copy link
Collaborator

Did you include the latest commit, 7e15947 ?

@turboderp
Copy link
Collaborator

Okay, I pushed another commit which might help

@Inktomi93
Copy link

Okay, I pushed another commit which might help

Just built off the most recent commit and everything seems to work as expected now.

@bdashore3 bdashore3 added the exl2 issue Exl2 issue, may be fixed in its dev branch label Aug 26, 2024
@pchristidis
Copy link

I cloned+build last night. I can load Mistral-large with the draft model, but it outputs gibberish. Not an issue if I disable the draft model, which I've done, since 22t/s is plenty.

@Lissanro
Copy link
Author

I updated tabbyAPI, and in exllamav2 folder (using the dev branch) I ran the following commands to install it within TabbyAPI's venv:

../tabbyAPI/venv/bin/pip install -r requirements.txt
../tabbyAPI/venv/bin/pip install .

Now I was able successfully load Mistral-Large-Instruct-2407-123B-5.0bpw-exl2 + Mistral-7B-Instruct-v0.3-exl2-3.5bpw as a draft model. So far, it seems to work correctly (even with Fast Tensors enabled). I also tested with 4bpw version of Mistral Large 2.

In addition to that, I tested with Llama-3.1-70B-Instruct-6.0bpw-h6-exl2 + Llama-3.1-8B-Instruct-3.0bpw-exl2 as a draft model, and that worked too.

@turboderp, thank you very much for fixing this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exl2 issue Exl2 issue, may be fixed in its dev branch
Projects
None yet
Development

No branches or pull requests

6 participants