Mixtral #223

nivibilla · 2023-12-10T10:19:40Z

Not an issue but seeing that exl2 2 bit quants of a 70b model can fit in a single 24gb GPU. I'm wondering if it's possible to run a quantized version of mixtral 7b*8 on a single 24gb GPU. And if that's something exllama2 could support or a completely different project?

Mistral MoE 7b*8 model
https://twitter.com/MistralAI/status/1733150512395038967?t=6jDOugc19MUNyOV1KK6Ing&s=19

nivibilla · 2023-12-10T10:29:00Z

So far I've heard it requires 2x80gb gpus to run. Or 4x 40gb. Which is just shy of the 140gb needed for 70b fp16. So if it's possible to run a 70b on 24gb then I hope it should be possible to run the MoE on a single 24gb gpu

turboderp · 2023-12-10T19:11:46Z

I don't doubt that it's possible.

The main challenges are:

quantization, specifically, ensuring that the calibration data triggers all of the experts enough, and that you don't end up using a very small sample for an expert that only triggers rarely, and
fast batching (and prompt ingestion), since each token in the batch triggers its own pair of experts.

I look forward to some sort of announcement from Mistral. Maybe even some reference code..

MarkMakers · 2023-12-10T19:26:42Z

So far I've heard it requires 2x80gb gpus to run. Or 4x 40gb. Which is just shy of the 140gb needed for 70b fp16. So if it's possible to run a 70b on 24gb then I hope it should be possible to run the MoE on a single 24gb gpu

It runs on 4 3090's taking under 92gb in 16fp - not sure where these other figures are from. I got it running yesterday.

ortegaalfredo · 2023-12-11T02:42:55Z

No, it can run in 2x3090 with 8-bit or 4-bit quantization using bitsandbytes, but it runs extremely slow. The only way to make it practical is with exllama or similar. Not even GPTQ works right now.

nivibilla · 2023-12-11T03:51:24Z

https://twitter.com/4evaBehindSOTA/status/1733551103105720601?t=SiKV8qH1IIoKiQlRBiIR5w&s=19

Tim from bitsandbytes says he's done MoE quantisation before. And that its actually easier than vanilla transformers.

CyberTimon · 2023-12-11T11:53:00Z

I really hope it get's exllama support. The model is multilingual and seems very powerful. Hope it can be implemented.

The official inference code for vllm etc dropped hours ago.

CyberTimon · 2023-12-11T12:00:28Z

The mixtral pr got merged into transformers, so you can look at their implementation: huggingface/transformers#27942

Edit:
Llama.cpp is now also adding mixtral: ggerganov/llama.cpp#4406

DutchEllie · 2023-12-12T15:04:08Z

When you do implement it, please don't forget the ROCm implementation 🙇

neutrino84 · 2023-12-13T18:09:14Z

NICE, glad to see this is being talked about 👍

DogeLord081 · 2023-12-14T02:49:00Z

When attempting to use Mixtral-8x7B-v0.1-GPTQ from https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ with chat.py, this error occurs:

python examples/chat.py -m C:\Users\danu0\Downloads\Artificial-Intelligence\Mixtral-8x7B-v0.1-GPTQ -mode raw
 -- Model: C:\Users\danu0\Downloads\Artificial-Intelligence\Mixtral-8x7B-v0.1-GPTQ
 -- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
Traceback (most recent call last):
  File "C:\Users\danu0\Downloads\Artificial-Intelligence\exllamav2\examples\chat.py", line 81, in <module>
    model, tokenizer = model_init.init(args)
  File "C:\Users\danu0\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\model_init.py", line 64, in init
    config.prepare()
  File "C:\Users\danu0\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\config.py", line 133, in prepare
    raise ValueError(f" ## Could not find {prefix}.* in model")
ValueError:  ## Could not find model.layers.0.mlp.down_proj.* in model

nivibilla · 2023-12-14T03:56:46Z

It's a different architecture. It won't work out of the box. Pls wait for Turbo lol.

CyberTimon · 2023-12-14T07:28:51Z

He is working on it. In the experimental branch, there is already a preview. (Which works but is unoptimized)

turboderp · 2023-12-17T12:46:58Z

All done. For now.

turboderp closed this as completed Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixtral #223

Mixtral #223

nivibilla commented Dec 10, 2023

nivibilla commented Dec 10, 2023

turboderp commented Dec 10, 2023

MarkMakers commented Dec 10, 2023

ortegaalfredo commented Dec 11, 2023

nivibilla commented Dec 11, 2023

CyberTimon commented Dec 11, 2023

CyberTimon commented Dec 11, 2023 •

edited

Loading

DutchEllie commented Dec 12, 2023

neutrino84 commented Dec 13, 2023

DogeLord081 commented Dec 14, 2023

nivibilla commented Dec 14, 2023

CyberTimon commented Dec 14, 2023

turboderp commented Dec 17, 2023

Mixtral #223

Mixtral #223

Comments

nivibilla commented Dec 10, 2023

nivibilla commented Dec 10, 2023

turboderp commented Dec 10, 2023

MarkMakers commented Dec 10, 2023

ortegaalfredo commented Dec 11, 2023

nivibilla commented Dec 11, 2023

CyberTimon commented Dec 11, 2023

CyberTimon commented Dec 11, 2023 • edited Loading

DutchEllie commented Dec 12, 2023

neutrino84 commented Dec 13, 2023

DogeLord081 commented Dec 14, 2023

nivibilla commented Dec 14, 2023

CyberTimon commented Dec 14, 2023

turboderp commented Dec 17, 2023

CyberTimon commented Dec 11, 2023 •

edited

Loading