Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add MoE Loras in ggml #2672

Closed
BarfingLemurs opened this issue Aug 19, 2023 · 7 comments
Closed

[Feature] Add MoE Loras in ggml #2672

BarfingLemurs opened this issue Aug 19, 2023 · 7 comments
Labels

Comments

@BarfingLemurs
Copy link
Contributor

BarfingLemurs commented Aug 19, 2023

Using this a reference: https://github.com/sail-sg/lorahub this uses flan models though, not sure if they are supported in ggml.

Wouldn't some version of this be very nice to see in main?

@Dampfinchen
Copy link

That's a nice idea. It could be the closest thing we can get to MoE, for now that is.

@Dampfinchen
Copy link

Dampfinchen commented Aug 22, 2023

Multiple people are now working on it. The idea is to switch specifically trained Lora adapters based on the task. Hopefully CUDA acceleration will be ready by then

@Dampfinchen
Copy link

@ggerganov Jon Durbin, creator or Airoboros, has now published his proof of concept. LMoE is basically one of the first open source approaches to Mixture of Experts, which could significantly improve model's performance. Here, it's done by switching specifically trained Loras on the fly based on the task.

https://twitter.com/jon_durbin/status/1694360998797250856

Do you think this can be implemented in llama.cpp? It could potentially revolutionize open source models.

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 23, 2023

Lora doesn't work for quants if you use GPU :(

GPU is important for bigger models that would really benefit.

Otherwise you will get a mixture of imbeciles.

@BarfingLemurs
Copy link
Contributor Author

mixture of imbeciles

Woah, hey now :)

I feel any 4bit cpu implementation could still greatly enhance the model, for best quality responses using 7b on mobile (android and SBC)

GPU is important for bigger models that would really benefit.

https://github.com/jondurbin/airoboros#lmoe

vllm Quantization:
vllm-project/vllm#744

It could be possible to run quantized AWQ models with this soon with the largest models on a local setup. It would be extremely fast and high quality.

But still, I just feel a cpu performant llama implementation would be the most exciting, freely available for a massive amount of people on almost every device. No cloud reliance or maxed hardware setup.

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 25, 2023

It should work with CPU and GPU is the point. Useful MoE is going to be 34b/70b

GGML has the best current quant options but they can't use loras + GPU.

AWQ is interesting but it's still new. And the same issue arises, does AWQ support loras. Right now exllama/autogptq do and so does BnB.

Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants