-
Notifications
You must be signed in to change notification settings - Fork 13.4k
model : add BailingMoeV2 support #16063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello,I think it's necessary to split and permute the Q/K weights because of the difference in the implementation of Rotary Position Embedding (RoPE). In the Hugging Face transformers library, the implementation is as follows:
This implementation splits the features into two contiguous halves and then rotates them. However, the implementation in llama.cpp is equivalent to this:
This version interleaves the features, selecting even and odd indices separately before rotating. Therefore, to ensure compatibility, the model parameters for Q and K need to be split and permuted to match the llama.cpp RoPE implementation, similar to the approach in this commit: 09e3df4 |
Nope, this is standard NeoX RoPE, |
Got it, thanks for pointing that out! |
Hello, could we flip this PR to “Open” and put it up for review? We're looking forward to being able to use it directly from our laptop as soon as possible. It's a lightweight model that also offers sufficiently high inference speed, and we believe it has the potential to be many people's next-generation AI assistant. |
I will after I've reimplemented the expert group selection, pending the merge of #16159 |
Hi, I've observed some unexpected behaviors in the current PR during testing, particularly with long-context processing and multi-turn dialogues. screenshotscompare llama-cli w/ same input:![]() ![]() compare eval callback:Before ROPE: ![]() ReproduceStep 1: Download Ling mini 2.0 model https://huggingface.co/inclusionAI/Ling-mini-2.0/tree/main Step 2: Convert hf to gguf python convert_hf_to_gguf.py --outtype bf16 ./Ling-mini-2.0 Step 3: eval callback ./build-arm64-apple-clang-debug/bin/llama-eval-callback --log-prefix -v -m ./Ling-mini-2.0-BF16.gguf -p "你好,你是谁?" --log-file ling-mini-bf16.txt Detail eval callback logs |
@im0qianqian Thanks for testing, but first make sure you are using the same chat template, yours does not match the jinja one, so use As for the eval callback it would be more interesting to compare against HF to see which direction each may deviate. |
Sorry — my baseline eval-callback log was flawed.
![]() ![]() This is the new eval callback log: ling-mini-bf16-topk-router.txt |
There is probably some issue with it right now, yes, let's retest once I reimplement it with No need to disable |
Just checking in on this, are we waiting on another PR to merge? I noticed Inclusion themselves published a GGUF ahead of official support. It would be nice to get official support merged! https://huggingface.co/inclusionAI/Ling-mini-2.0-GGUF |
Sigh, it's preferable not to release GGUFs before a PR is merged, there may be breaking changes during review... |
they have their own fork for inference |
Hi, when is this PR expected to be merged into master? |
Hello, perhaps we have a better solution. The Ling mini 2.0 is a small-sized model with strong capabilities and exceptionally fast inference performance. Currently, many developers in the community hope to use this model on platforms like llama.cpp / ollama / lm studio. Therefore, we could prioritize merging a version without group routing support. This way, at least everyone can use the model normally. Once we resolve the issue in In my view, regarding group-wise routing, it’s primarily a pre-training strategy that combines optimizations like DeepEP under 256 experts to achieve better training efficiency. For inference, the model can still perform excellently even without group routing. |
I will try to resolve this today/this weekend, but as you say there really is no issue in releasing a version without it (as long as the metadata is there to support it in the future). |
soo long |
Is there any way to get the reviewers to notice this? |
This comments answers this question CUDA build
CPU-only build
The model is from here |
Thanks, reproduced, looking into it... |
@ikawrakow Fixed, thanks again for the report. |
Does this implementation cover linear attention ones as well? |
No, it's a different architecture that requires Lightning Attention to be implemented. |
Could contact the reviewers to get this PR merged? |
third time's the charm?
Thanks for your help, I had time to roll a mainline quant compatible Ling-1T test quant Early tests with this PR 👈 Detailsmodel=/mnt/data/models/ubergarm/Ling-1T-GGUF/Ling-1T-smol-IQ2_XXS.gguf
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/Ling-1T-smol-IQ2_XXS \
--ctx-size 65536 \
-fa 1 \
-ctk q8_0 -ctv q8_0 \
-ub 4096 -b 4096 \
--parallel 1 \
--threads 128 \
--threads-batch 192 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 \
--no-mmap
llama_model_loader: - type f32: 473 tensors
llama_model_loader: - type q8_0: 8 tensors
llama_model_loader: - type q4_K: 161 tensors
llama_model_loader: - type q5_K: 80 tensors
llama_model_loader: - type q6_K: 153 tensors
llama_model_loader: - type iq2_xxs: 228 tensors model=/mnt/raid/hf/Ling-1T-GGUF/smol-IQ2_XXS/Ling-1T-smol-IQ2_XXS-00001-of-00006.gguf
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-perplexity \
-m "$model" \
-f wiki.test.raw \
--seed 1337 \
-fa 1 \
--ctx-size 512 \
-ub 4096 -b 4096 \
--numa numactl \
--threads 128 \
--threads-batch 192 \
--no-mmap
system_info: n_threads = 128 (n_threads_batch = 192) / 768 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 435.37 ms
perplexity: calculating perplexity over 594 chunks, n_ctx=512, batch_size=4096, n_seq=8
perplexity: 88.07 seconds per pass - ETA 1 hours 48.98 minutes
[1]2.4932,[2]3.8882,[3]3.3771,[4]2.7160,[5]2.3823,[6]2.2384,[7]2.1330,[8]2.0418,[9]2.0070,[10]1.9620,[11]1.8993,[12]1.8480,[13]1.8423,[14]1.8780,[15]1.9077,[16]1.9946,
Final estimate: PPL = 2.5870 +/- 0.01279 With some luck I'll be able to clear up my huggingface I can provide the exact recipe if needed just holler. Thanks and great work and appreciate all the collaboration! UPDATE The final perplexity looks good to me: |
The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture.
@ggerganov If you have the time a review would be appreciated to celebrate the 1 month anniversary of this PR. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expert grouping logic seems correct to me. But ideally you should compare the logits with the reference implementation just in case.
I've verified it manually now, it should be correct. Additionally the perplexity tests confirm that it works well. |
Adds support for Ling(-flash) 2.0 and Ring 2.0, base and instruction tuned models (sans MTP, but layer included).
Includes expert group selection implemented as aggml_custom
function, not sure if there's a better way, or if it makes sense to implement some sort of masking op?Requires #16060 to work on CUDA, also seems to be some issue with CPU, will check later. Either way, will reimplement expert group selection withset_rows
once possible.