-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Add MoE Loras in ggml #2672
Comments
That's a nice idea. It could be the closest thing we can get to MoE, for now that is. |
Multiple people are now working on it. The idea is to switch specifically trained Lora adapters based on the task. Hopefully CUDA acceleration will be ready by then |
@ggerganov Jon Durbin, creator or Airoboros, has now published his proof of concept. LMoE is basically one of the first open source approaches to Mixture of Experts, which could significantly improve model's performance. Here, it's done by switching specifically trained Loras on the fly based on the task. https://twitter.com/jon_durbin/status/1694360998797250856 Do you think this can be implemented in llama.cpp? It could potentially revolutionize open source models. |
Lora doesn't work for quants if you use GPU :( GPU is important for bigger models that would really benefit. Otherwise you will get a mixture of imbeciles. |
Woah, hey now :) I feel any 4bit cpu implementation could still greatly enhance the model, for best quality responses using 7b on mobile (android and SBC)
https://github.com/jondurbin/airoboros#lmoe vllm Quantization: It could be possible to run quantized AWQ models with this soon with the largest models on a local setup. It would be extremely fast and high quality. But still, I just feel a cpu performant llama implementation would be the most exciting, freely available for a massive amount of people on almost every device. No cloud reliance or maxed hardware setup. |
It should work with CPU and GPU is the point. Useful MoE is going to be 34b/70b GGML has the best current quant options but they can't use loras + GPU. AWQ is interesting but it's still new. And the same issue arises, does AWQ support loras. Right now exllama/autogptq do and so does BnB. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Using this a reference: https://github.com/sail-sg/lorahub this uses flan models though, not sure if they are supported in ggml.
Wouldn't some version of this be very nice to see in main?
The text was updated successfully, but these errors were encountered: