-
Notifications
You must be signed in to change notification settings - Fork 154
Adding Ling/Ring (a.k.a., Bailing-MoE2) support #833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Haha, the Ling-mini-2.0 model that I'm using for testing seems to be yet another one of those model where
This is without even using an imatrix. |
May I ask, do you have any idea why this is? In this case the difference is quite substantial. Would this signal an issue with the quality of the model in general or is it just pure coincidence? |
Not sure. It could be many things. But in this specific case I would wait for the grouped experts routing to be implemented correctly before drawing any conclusions. |
It should not crash, it works for me on CUDA and CPU, do you have more details? |
OK, it no longer crashes after your fix. But now Wikitext perplexity is much higher:
Is this expected? |
Yes, I saw that too, and I think so, basically what group selection does is remove half the experts. The experts are divided into 8 groups and only 4 of them are used based on the highest scores within each group. I don't know how the model is trained, but I imagine each group has some sort of specialization, so due to the somewhat random nature of wikitext a higher PPL is to be expected. In case anyone wonders, the current group selection implementation selects groups based on a single token within each group, while the disabled one (due to inefficiency (only CPU support)) selects groups based on the sum of two (like the original implementation). Both implementations yield a similarly higher PPL. |
What if we took something less random? Say, Pride and Prejudice from project Gutenberg. We get
If we did have the situation where "each group has some sort of specialization", one would expect that no relevant groups get discarded when reading a well known novel (that with almost 100% probability has been in the training data), so the grouped expert routing shouldn't be having a 50% higher PPL, and it shouldn't be higher compared to just using the top 4 experts with standard routing. For comparison, LlaMA-3.1-8B-Instruct gets PPL = 2.7745 on Pride and Prejudice. |
True, but a PPL of ~32 with standard routing is extremely high to begin with, I would say this model does not have it in its training data... |
OK, I'll merge this PR without enabling grouped expert routing. As far as I can tell, the grouping does not improve inference (if anything, it makes matters worse). It also results in a significant performance degradation. If we get reports that |
Here some experimental results of Wikitext perplexity computed with different u-batch sizes.
The "diff" column shows the difference to the average of the 6 results in the respective experts routing type. Differences in the range of 0.1% between different parameters / hardware / etc. are normal, and this is what we see for the standard routing approach. For the grouped routing we have results varying wildly, and well beyond the expected 0.1 range. Looking at the code this is kind of expected, considering that the selected experts for each token will depend on the tokens present in the batch. But perhaps I'm misunderstanding the implementation? Can you explain in plain English what you are trying to accomplish? Thanks. |
That's correct.
I'm merely reproducing the original implementation: All the experts are "grouped" into 8 groups and each group are then scored and the 4 lowest scoring groups are masked out (setting the tokens to -inf) before proceeding to the normal topk selection. |
As far as I can tell, the code in your |
Ok, that was my intention at least, I might have done something wrong of course. My understanding of what the original (and my) code does is what I wrote in plain English though. |
Derived from https://github.com/ggml-org/llama.cpp/pull/16063/files
Did not add the expert grouping thing yet, as it is not working in the mainline PR (I get a crash when I try to run it).
Nevertheless, based on a very rudimentary testing it seems to be working OK. Hence publishing it for testing by others.
Did not yet add the necessary changes toconvert_hf_to_gguf.py
, so model conversion needs to be done with the mainline PR for now.The conversion scripts seems to be working fine, at least with the model I tested with.
Here a quick
llama-sweep-bench
result using this Q4_K_M quantized model (RTX-4080,-c 32768 -ub 2048 -t 1 -ngl 100 -fa -fmoe
)Closes #813