Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral #223

Closed
nivibilla opened this issue Dec 10, 2023 · 13 comments
Closed

Mixtral #223

nivibilla opened this issue Dec 10, 2023 · 13 comments

Comments

@nivibilla
Copy link
Contributor

Not an issue but seeing that exl2 2 bit quants of a 70b model can fit in a single 24gb GPU. I'm wondering if it's possible to run a quantized version of mixtral 7b*8 on a single 24gb GPU. And if that's something exllama2 could support or a completely different project?

Mistral MoE 7b*8 model
https://twitter.com/MistralAI/status/1733150512395038967?t=6jDOugc19MUNyOV1KK6Ing&s=19

@nivibilla
Copy link
Contributor Author

So far I've heard it requires 2x80gb gpus to run. Or 4x 40gb. Which is just shy of the 140gb needed for 70b fp16. So if it's possible to run a 70b on 24gb then I hope it should be possible to run the MoE on a single 24gb gpu

@turboderp
Copy link
Member

I don't doubt that it's possible.

The main challenges are:

  1. quantization, specifically, ensuring that the calibration data triggers all of the experts enough, and that you don't end up using a very small sample for an expert that only triggers rarely, and
  2. fast batching (and prompt ingestion), since each token in the batch triggers its own pair of experts.

I look forward to some sort of announcement from Mistral. Maybe even some reference code..

@MarkMakers
Copy link

So far I've heard it requires 2x80gb gpus to run. Or 4x 40gb. Which is just shy of the 140gb needed for 70b fp16. So if it's possible to run a 70b on 24gb then I hope it should be possible to run the MoE on a single 24gb gpu

It runs on 4 3090's taking under 92gb in 16fp - not sure where these other figures are from. I got it running yesterday.

@ortegaalfredo
Copy link

No, it can run in 2x3090 with 8-bit or 4-bit quantization using bitsandbytes, but it runs extremely slow. The only way to make it practical is with exllama or similar. Not even GPTQ works right now.

@nivibilla
Copy link
Contributor Author

https://twitter.com/4evaBehindSOTA/status/1733551103105720601?t=SiKV8qH1IIoKiQlRBiIR5w&s=19

Tim from bitsandbytes says he's done MoE quantisation before. And that its actually easier than vanilla transformers.

@CyberTimon
Copy link
Contributor

I really hope it get's exllama support. The model is multilingual and seems very powerful. Hope it can be implemented.

The official inference code for vllm etc dropped hours ago.

@CyberTimon
Copy link
Contributor

CyberTimon commented Dec 11, 2023

The mixtral pr got merged into transformers, so you can look at their implementation: huggingface/transformers#27942

Edit:
Llama.cpp is now also adding mixtral: ggerganov/llama.cpp#4406

@DutchEllie
Copy link

When you do implement it, please don't forget the ROCm implementation 🙇

@neutrino84
Copy link

NICE, glad to see this is being talked about 👍

@DogeLord081
Copy link

When attempting to use Mixtral-8x7B-v0.1-GPTQ from https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ with chat.py, this error occurs:

python examples/chat.py -m C:\Users\danu0\Downloads\Artificial-Intelligence\Mixtral-8x7B-v0.1-GPTQ -mode raw
 -- Model: C:\Users\danu0\Downloads\Artificial-Intelligence\Mixtral-8x7B-v0.1-GPTQ
 -- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
Traceback (most recent call last):
  File "C:\Users\danu0\Downloads\Artificial-Intelligence\exllamav2\examples\chat.py", line 81, in <module>
    model, tokenizer = model_init.init(args)
  File "C:\Users\danu0\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\model_init.py", line 64, in init
    config.prepare()
  File "C:\Users\danu0\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\config.py", line 133, in prepare
    raise ValueError(f" ## Could not find {prefix}.* in model")
ValueError:  ## Could not find model.layers.0.mlp.down_proj.* in model 

@nivibilla
Copy link
Contributor Author

It's a different architecture. It won't work out of the box. Pls wait for Turbo lol.

@CyberTimon
Copy link
Contributor

He is working on it. In the experimental branch, there is already a preview. (Which works but is unoptimized)

@turboderp
Copy link
Member

All done. For now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants