[Feature] Add MoE Loras in ggml #2672

BarfingLemurs · 2023-08-19T16:54:03Z

Using this a reference: https://github.com/sail-sg/lorahub this uses flan models though, not sure if they are supported in ggml.

Wouldn't some version of this be very nice to see in main?

Dampfinchen · 2023-08-20T16:30:45Z

That's a nice idea. It could be the closest thing we can get to MoE, for now that is.

Dampfinchen · 2023-08-22T10:34:33Z

Multiple people are now working on it. The idea is to switch specifically trained Lora adapters based on the task. Hopefully CUDA acceleration will be ready by then

Dampfinchen · 2023-08-23T16:10:50Z

@ggerganov Jon Durbin, creator or Airoboros, has now published his proof of concept. LMoE is basically one of the first open source approaches to Mixture of Experts, which could significantly improve model's performance. Here, it's done by switching specifically trained Loras on the fly based on the task.

https://twitter.com/jon_durbin/status/1694360998797250856

Do you think this can be implemented in llama.cpp? It could potentially revolutionize open source models.

Ph0rk0z · 2023-08-23T23:33:25Z

Lora doesn't work for quants if you use GPU :(

GPU is important for bigger models that would really benefit.

Otherwise you will get a mixture of imbeciles.

BarfingLemurs · 2023-08-25T16:21:13Z

mixture of imbeciles

Woah, hey now :)

I feel any 4bit cpu implementation could still greatly enhance the model, for best quality responses using 7b on mobile (android and SBC)

GPU is important for bigger models that would really benefit.

https://github.com/jondurbin/airoboros#lmoe

vllm Quantization:
vllm-project/vllm#744

It could be possible to run quantized AWQ models with this soon with the largest models on a local setup. It would be extremely fast and high quality.

But still, I just feel a cpu performant llama implementation would be the most exciting, freely available for a massive amount of people on almost every device. No cloud reliance or maxed hardware setup.

Ph0rk0z · 2023-08-25T18:55:52Z

It should work with CPU and GPU is the point. Useful MoE is going to be 34b/70b

GGML has the best current quant options but they can't use loras + GPU.

AWQ is interesting but it's still new. And the same issue arises, does AWQ support loras. Right now exllama/autogptq do and so does BnB.

github-actions · 2024-04-09T01:07:02Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ciekawy mentioned this issue Dec 10, 2023

llama : add Mixtral support #4381

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add MoE Loras in ggml #2672

[Feature] Add MoE Loras in ggml #2672

BarfingLemurs commented Aug 19, 2023 •

edited

Loading

Dampfinchen commented Aug 20, 2023

Dampfinchen commented Aug 22, 2023 •

edited

Loading

Dampfinchen commented Aug 23, 2023

Ph0rk0z commented Aug 23, 2023

BarfingLemurs commented Aug 25, 2023

Ph0rk0z commented Aug 25, 2023

github-actions bot commented Apr 9, 2024

[Feature] Add MoE Loras in ggml #2672

[Feature] Add MoE Loras in ggml #2672

Comments

BarfingLemurs commented Aug 19, 2023 • edited Loading

Dampfinchen commented Aug 20, 2023

Dampfinchen commented Aug 22, 2023 • edited Loading

Dampfinchen commented Aug 23, 2023

Ph0rk0z commented Aug 23, 2023

BarfingLemurs commented Aug 25, 2023

Ph0rk0z commented Aug 25, 2023

github-actions bot commented Apr 9, 2024

BarfingLemurs commented Aug 19, 2023 •

edited

Loading

Dampfinchen commented Aug 22, 2023 •

edited

Loading