-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixtral #223
Comments
So far I've heard it requires 2x80gb gpus to run. Or 4x 40gb. Which is just shy of the 140gb needed for 70b fp16. So if it's possible to run a 70b on 24gb then I hope it should be possible to run the MoE on a single 24gb gpu |
I don't doubt that it's possible. The main challenges are:
I look forward to some sort of announcement from Mistral. Maybe even some reference code.. |
It runs on 4 3090's taking under 92gb in 16fp - not sure where these other figures are from. I got it running yesterday. |
No, it can run in 2x3090 with 8-bit or 4-bit quantization using bitsandbytes, but it runs extremely slow. The only way to make it practical is with exllama or similar. Not even GPTQ works right now. |
https://twitter.com/4evaBehindSOTA/status/1733551103105720601?t=SiKV8qH1IIoKiQlRBiIR5w&s=19 Tim from bitsandbytes says he's done MoE quantisation before. And that its actually easier than vanilla transformers. |
I really hope it get's exllama support. The model is multilingual and seems very powerful. Hope it can be implemented. The official inference code for vllm etc dropped hours ago. |
The mixtral pr got merged into transformers, so you can look at their implementation: huggingface/transformers#27942 Edit: |
When you do implement it, please don't forget the ROCm implementation 🙇 |
NICE, glad to see this is being talked about 👍 |
When attempting to use Mixtral-8x7B-v0.1-GPTQ from https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ with chat.py, this error occurs:
|
It's a different architecture. It won't work out of the box. Pls wait for Turbo lol. |
He is working on it. In the experimental branch, there is already a preview. (Which works but is unoptimized) |
All done. For now. |
Not an issue but seeing that exl2 2 bit quants of a 70b model can fit in a single 24gb GPU. I'm wondering if it's possible to run a quantized version of mixtral 7b*8 on a single 24gb GPU. And if that's something exllama2 could support or a completely different project?
Mistral MoE 7b*8 model
https://twitter.com/MistralAI/status/1733150512395038967?t=6jDOugc19MUNyOV1KK6Ing&s=19
The text was updated successfully, but these errors were encountered: