Skip to content

Optimized DeepSeek V2/V3 implementation (MLA) #11446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

fairydreaming
Copy link
Collaborator

@fairydreaming fairydreaming commented Jan 27, 2025

This PR introduces various optimizations for DeepSeek V2/V3 implementation:

Note that you need to reconvert the model to use this implementation.

Performance compared to the previous "naive" implementation:

deepseek-mla

deepseek-lite-mla-pp

deepseek-r1-mla

deepseek-mla-pp

CUDA performance is worse for short context lengths, but the curve is flatter:

deepseek-lite-mla

deepseek-lite-cuda-mla-pp

TODO:

  • remove unused kv_b tensor from the model
  • maybe add support for old model files (compute k_b and v_b during inference with reduced performance)
  • wait for completion of: llama : refactor llama_kv_cache, llama_context and llm_build_context #11213
  • implement MLA KV cache
  • address regressions in prompt processing performance (different permutations of tensors?) - I don't think it's possible, as this implementation is more compute-intensive compared to regular attention implementation

@fairydreaming fairydreaming marked this pull request as draft January 28, 2025 11:23
@wronkiew
Copy link

@fairydreaming do you have a converted model available or instructions for replicating your setup? I would like to run some benchmarks on these changes.

@fairydreaming
Copy link
Collaborator Author

@fairydreaming do you have a converted model available or instructions for replicating your setup? I would like to run some benchmarks on these changes.

@wronkiew What model would you like to test?

@wronkiew
Copy link

@fairydreaming do you have a converted model available or instructions for replicating your setup? I would like to run some benchmarks on these changes.

@wronkiew What model would you like to test?

V3/R1, Q4_K_S.

@fairydreaming
Copy link
Collaborator Author

@fairydreaming do you have a converted model available or instructions for replicating your setup? I would like to run some benchmarks on these changes.

@wronkiew What model would you like to test?

V3/R1, Q4_K_S.

@wronkiew I don't have the model uploaded (my upload bandwidth is too low), you have to download, convert to bf16, convert to gguf and quantize the original model by yourself (or download one that is already converted to bf16, this will save you one step).

@fairydreaming
Copy link
Collaborator Author

I spent some time investigating this hint from the DeepSeek V2 paper:

Fortunately, due to the associative law of matrix multiplication, we can absorb $𝑊^{𝑈𝐾}$ into $𝑊^{𝑈𝑄}$ , and $𝑊^{𝑈𝑉}$ into $𝑊^𝑂$

At first glance it looks reasonable, each absorbed matrix allows to replace two matrix multiplications with a single multiplication, thus reducing the number of operations.

However when we take a look into dimensions of these matrices, this stops being reasonable. For example in DeepSeek V2 lite:

  • $𝑊^{𝑈𝑄}$ tensor has shape [2048, 2048], that is [16, 2048, 128] after reshaping to 3d and permutation
  • $𝑊^{𝑈𝐾}$ tensor has shape [128, 8192], that is [16, 512, 128] after reshaping to 3d and permutation
  • combined "absorbed" tensor has shape [16, 512, 2048]

So (let's ignore the head dimension) this allows to replace two multiplications: with [2048, 128] matrix and [512, 128] matrix with a single multiplication with a [512, 2048]. The combined matrix has over 3x elements compared to individual matrices, so it will take more memory and it will be actually slower to multiply compared to two multiplications with smaller matrices.

With $𝑊^{𝑈𝑉}$ and $𝑊^𝑂$ it's the same story:

  • $𝑊^{𝑈𝑉}$ tensor has shape [2048, 512], that is [16, 512, 128] after reshaping to 3d and permutation
  • $𝑊^𝑂$ tensor has shape [2048, 2048], that is [16, 2048, 128] after reshaping to 3d and permutation
  • combined "absorbed" tensor has shape [16, 512, 2048]

I also found this blog post: https://github.com/xjdr-alt/mla_blog_translation where they mention:

Compared to performing projection with these particularly large low-rank matrices, it is obviously more advantageous to multiply them successively according to the low-rank decomposition form. Therefore, we believe that this optimization step is not very necessary.

So it looks like a dead end, it won't give us any speed gains.

@divine-taco
Copy link

I ran into an issue with DeepSeek-R1-UD-Q2_K_XL from unsloth/DeepSeek-R1-GGUF

llama_model_load: error loading model: missing tensor 'blk.0.attn_k_b.weight'                                                        llama_model_load_from_file_impl: failed to load model

@fairydreaming
Copy link
Collaborator Author

fairydreaming commented Jan 31, 2025

I ran into an issue with DeepSeek-R1-UD-Q2_K_XL from unsloth/DeepSeek-R1-GGUF

llama_model_load: error loading model: missing tensor 'blk.0.attn_k_b.weight'                                                        llama_model_load_from_file_impl: failed to load model

As I wrote in the PR:

Note that you need to reconvert the model to use this implementation.

Existing GGUFs won't work, you have to convert and quantize one with the code from this PR.

@danielhanchen
Copy link
Contributor

Ohh hmm should I re-quantize the ones in https://huggingface.co/unsloth/DeepSeek-R1-GGUF?

@fairydreaming
Copy link
Collaborator Author

Ohh hmm should I re-quantize the ones in https://huggingface.co/unsloth/DeepSeek-R1-GGUF?

I think it's best to wait a bit until this is stable and merged, it's possible that there will be some changes that would cause them to stop working and you'd have to repeat the conversion again.

@fairydreaming
Copy link
Collaborator Author

I updated the token generation performance plots in the PR post, also added some new showing the prompt processing performance. The optimized implementation generally performs WORSE in prompt processing - DeepSeek R1 671B Q4_K_S running on CPU performs only a little worse (~10% with 4k prompt), but DeepSeek V2 Lite Q8_0 running on RTX 4090 performs MUCH WORSE (~30% with 16k prompt) and in both cases the gap widens as the prompt length increases. So it's not all sunshine and rainbows.

Considering all these performance regressions I think the best course of action would be to put the optimized implementation into separate model architecture (LLM_ARCH_DEEPSEEK2_MLA or something like this). This will prevent issues with existing GGUFs - they would keep working with existing architecture. I guess in this case the convert script would have to allow selection of the target model architecture with some option, but that shouldn't be difficult to add. @ggerganov what do you think?

Comment on lines +6406 to +6409
// whether to use n_tokens as the matrix dimension during multiplication or n_head
// n_tokens is higher during prompt processing, this allows to optimize for this case
bool pp_opt = n_tokens > n_head;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure this is the right approach. Haven't followed through the logic yet, but it seems strange to involve so many permutes and conts.

I would first look into improving the FA kernels to support DeepSeek head sizes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure this is the right approach. Haven't followed through the logic yet, but it seems strange to involve so many permutes and conts.

Hmm? I'm quite sure there's only one ggml_cont() call (excluding the ones for CUDA compatibility that already existed in the previous implementation).

As for the permutes the idea is to multiply by a matrix with a second dimension equal to the number of heads instead of the number of tokens (which is 1) during a single sequence token generation, that increased the performance on a CPU a bit.

So during prompt processing we have 2 permutes and 1 cont. During token generation we have 5 permutes (yeah, that may be a lot) and 0 conts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the correction - I did imagine the extra conts when I saw the permutes.

@ggerganov
Copy link
Member

Considering all these performance regressions I think the best course of action would be to put the optimized implementation into separate model architecture (LLM_ARCH_DEEPSEEK2_MLA or something like this). This will prevent issues with existing GGUFs - they would keep working with existing architecture. I guess in this case the convert script would have to allow selection of the target model architecture with some option, but that shouldn't be difficult to add. @ggerganov what do you think?

While this is possible to do, I think it has a lot of cons. It will make it difficult for everyone to know which model variation on which hardware to use for better performance. Ideally, we want to have a single implementation that is optimal in all use cases, which can be deprecated at some point for a better alternative. But having 2 alternatives neither of which is optimal is not great.

Also, I'm not sure how this implementation fits with multiple parallel sequences and it introduces extra KV cache logic, specific to this type of arch.

I know there is a lot of interest in the DeepSeek arch right now and such optimizations are really important for people. But I think that we have to keep this work in a PR for a while. It is much more important to fix the software architecture in libllama after which such changes should become easier.

@fairydreaming
Copy link
Collaborator Author

Considering all these performance regressions I think the best course of action would be to put the optimized implementation into separate model architecture (LLM_ARCH_DEEPSEEK2_MLA or something like this). This will prevent issues with existing GGUFs - they would keep working with existing architecture. I guess in this case the convert script would have to allow selection of the target model architecture with some option, but that shouldn't be difficult to add. @ggerganov what do you think?

While this is possible to do, I think it has a lot of cons. It will make it difficult for everyone to know which model variation on which hardware to use for better performance. Ideally, we want to have a single implementation that is optimal in all use cases, which can be deprecated at some point for a better alternative. But having 2 alternatives neither of which is optimal is not great.

That may not be possible - IMHO MLA attention implementation that caches "compressed" latent kv representations introduces unavoidable computational overhead due to the need to "decompress" these representations in order to calculate attention scores and attention output. So "naive" attention implementation that caches full K/V vectors will always use less compute but more memory bandwidth, while caching latent representations results in using more compute, but less memory bandwidth. So there can't be a single implementation optimal in all use cases. I'd be happy to be proven wrong about this, though.

Also, I'm not sure how this implementation fits with multiple parallel sequences and it introduces extra KV cache logic, specific to this type of arch.

I think there shouldn't be any problems with this, as there is a straightforward direct mapping between the cached representations and full K/V vectors.

I know there is a lot of interest in the DeepSeek arch right now and such optimizations are really important for people. But I think that we have to keep this work in a PR for a while. It is much more important to fix the software architecture in libllama after which such changes should become easier.

That's fine with me. I'm taking a break from this anyway, got bored with tensor shuffling looking for 0.1 t/s more performance. 😉

@saood06
Copy link

saood06 commented Feb 2, 2025

@fairydreaming
Is there any reason this should cause issues with RPC.
Encountered:

ggml_cuda_compute_forward: cannot compute kqv-31: src0->ne[3] = 1, src1->ne[3] = 2 - fallback to CPU
evaluate_and_capture_cuda_graph: op not supported kqv-31 (MUL_MAT)
[...]\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2660: GGML_ASSERT(ok) failed

I don't have a quant on hand that I can test without this branch, but this branch does give me a nice performance boost for TG at longer contexts, but RPC to CUDA does not work.

@saood06
Copy link

saood06 commented Mar 17, 2025

I'm not sure if this tells the whole story though, as the Q8_0 definitely seems to think longer and seems to do 2-3 more "Oh wait" type paths...

If you want to quantify this you might be able to modify the hellaswag or winogrande sections of the perplexity example to also measure how many thought tokens, haven't looked into so maybe another test harness might be easier, but I agree that perplexity numbers wouldn't really give much insight into that.

Edit: This article https://mp.weixin.qq.com/s/vIrvbVJ6Nv00Ehre1zZwMw says that they found more thought tokens generated at Q4 than at Q8

(Machine translated part of the article):

we found that the average length of the chain of thought of Q4 is 45% longer than that of Q8, that is to say, it outputs 45% more invalid tokens, so even though the speed of generating tokens is higher in Q4, the completion of the task will be even slower.

@Thomas-MMJ
Copy link

Probably depends on the quantization method (how outliers are handled) not necessarily the degree of quantization.

@jukofyork
Copy link
Collaborator

jukofyork commented Mar 31, 2025

@fairydreaming Sorry I don't have a Reddit account, but just to say you can definitely order the RTX PRO 6000 in the UK:

https://www.scan.co.uk/shop/computer-hardware/gpu-nvidia-workstation/nvidia-workstation-visualisation-graphics-cards

I've preordered a 300w Max-Q and they emailed me to say it was expected 21st April (scan.co.uk is usually pretty accurate with its restock dates).

I'm surprised they don't have a queue for these like they do for the RTX 5090s:

https://www.scan.co.uk/nvidia/rtx-50-series

I'm hoping most scalpers will go for the 600w version (which is expected 30th April supposedly), but that doesn't seem to have a queue either surprisingly and seems pretty good value compared to the 5090s if you can deal with the heat?

(I'll likely be underwatting mine to 200w and only care about the 1.7T/s memory bandwidth...)

@jukofyork
Copy link
Collaborator

jukofyork commented Mar 31, 2025

I should hopefully have some good draft models for r1 and v3 ready later this week too!

I wasn't having much luck until someone linked me this paper:

https://huggingface.co/rdsm/QwenPhi-4-0.5b-Draft/discussions/1#67e97e3c6887b70da5509715

and I think I've finally got it working properly by including 30% raw code in the mix - without this it seems to lose its ability to work well as a draft model altogether:

Screenshot_20250331-190526

I don't really understand why this would have such a huge effect (30% vs 65%+ acceptance rate!?), but it definitely seems super-important.


I think I've also perfected "trimming" the qwen-2.5:0.5b models so it should work even better in llama.cpp and add almost no latency:

#10466 (comment)

Small draft model is needed (sine qua non). 0.5B size seems to work well. Any model in the range of 8G or above can benefit by distilling a 0.5B draft and speculating the model. Returns fall off rapidly as draft gets bigger, already questionable at 1.5B and not really useful at 3B draft.

I've successfully removed 1/2 the layers and also halved the intermediate size (which was surprisingly large at 5.7x the hidden size in the qwen-2.5:0.5b - compared to llama-3.2:1b) so they should run at equivalent speed to a qwen-2.5:0.2b model (but in practice are slightly larger due to having to untie the embedding and lm_head tensors for fine-tuning).

I will post the HF links when they are done later this week.

@jukofyork
Copy link
Collaborator

jukofyork commented Mar 31, 2025

Probably depends on the quantization method (how outliers are handled) not necessarily the degree of quantization.

The MLA version is a bit strange as it's the attn_k_b tensor that causes the overflow problems, and it's almost like you are causing a KQ overflow before you've done the actual KQ multiply (by the way it expands the 192-element MHA into 576-element MQA).

I've ended up just leaving this as BF16 for this reason, but for the PR this isn't really an acceptable solution as some backends won't work with BF16. Leaving as F32 also fixes the problem, but for the CUDA backend; it really hurts performance (Q8_0 seems to both hurt performance and cause overflows/weirdness for me too).

@saood06
Copy link

saood06 commented Apr 2, 2025

@jukofyork

and I think I've finally got it working properly by including 30% raw code in the mix - without this it seems to lose its ability to work well as a draft model altogether:

Screenshot_20250331-190526

I don't really understand why this would have such a huge effect (30% vs 65%+ acceptance rate!?), but it definitely seems super-important.

Because they are testing the accept rate while inferencing code the table right below it (Table 9 pasted below) shows that code matters less for nnatural langauge tasks:

image

This is still interesting because it shows mostly code and a little bit of natural language in your dataset mix is better than mostly natural language and some code even when the text is natural language based.

It might be too late, but maybe trying having majority code be your dataset mix?

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 2, 2025

@jukofyork

and I think I've finally got it working properly by including 30% raw code in the mix - without this it seems to lose its ability to work well as a draft model altogether:
Screenshot_20250331-190526
I don't really understand why this would have such a huge effect (30% vs 65%+ acceptance rate!?), but it definitely seems super-important.

Because they are testing the accept rate while inferencing code the table right below it (Table 9 pasted below) shows that code matters less for nnatural langauge tasks:

image

This is still interesting because it shows mostly code and a little bit of natural language in your dataset mix is better than mostly natural language and some code even when the text is natural language based.

It might be too late, but maybe trying having majority code be your dataset mix?

I've already restarted it (for a different reason I'll explain below), but I did see that but thought the differences weren't really enough justification to do this:

  • Compared to the 30% vs 60% hit-rate for not including any code in the pre-training mix, these differences are small.
  • The way that r1 writes in the thinking stage is often much more natural language compared to a non-thinking model when asked to act on a bit of code.
  • The 2 large datasets I have which are real r1 samples still have quite a lot of code in them.

The key finding for this paper, that was very different to turboderp's original Qwama-0.5B-Instruct methodology; is to include pure code in the (continued) pre-training data. I was following this and just used the same common-crawl-sample first and then fine-tuned that model with the 2 large datasets I have which are real r1 samples.


So I've now restarted as I've found a way to trim the hidden state / heads (as well as the layers and intermediate size), and this needs more data to recover.

Here is a trimmed version of Qwen2.5-0.5B-Instruct I'm training currently:

Input model configuration:
- Target vocabulary size    : 129280 (used = 128815, unused = 465)
- Donor vocabulary size     : 151936
- Donor num layers          : 24 (tied embeddings = True)
- Donor hidden size         : 896
- Donor attention heads     : 14
- Donor intermediate size   : 4864 (ratio = 1:5.4)
- Donor total parameters    : 494032768 (0.49B)
-- Embedding parameters     : 136134656 (0.14B)
-- Non-embedding parameters : 357898112 (0.36B)

Processing 3 automatic token overrides:
✔ 'bos_token_id' : 0 '<|begin▁of▁sentence|>' → [151643] '<|endoftext|>'
✔ 'eos_token_id' : 1 '<|end▁of▁sentence|>' → [151645] '<|im_end|>'
✘ 'pad_token_id' : 1 is already mapped to [151645]

Processing 16 manual token overrides:
✔      2 : '<|▁pad▁|>' → [151643] '<|endoftext|>'
✔ 128800 : '<|fim▁hole|>' → [151660] '<|fim_middle|>'
✔ 128801 : '<|fim▁begin|>' → [151659] '<|fim_prefix|>'
✔ 128802 : '<|fim▁end|>' → [151661] '<|fim_suffix|>'
✔ 128803 : '<|User|>' → [151644, 872, 198] '<|im_start|>user\n'
✔ 128804 : '<|Assistant|>' → [151644, 77091, 198] '<|im_start|>assistant\n'
✔ 128805 : '<|EOT|>' → [151643] '<|endoftext|>'
✔ 128806 : '<|tool▁calls▁begin|>' → [151657] '<tool_call>'
✔ 128808 : '<|tool▁call▁begin|>' → [151657] '<tool_call>'
✔ 128810 : '<|tool▁outputs▁begin|>' → [151657] '<tool_call>'
✔ 128812 : '<|tool▁output▁begin|>' → [151657] '<tool_call>'
✔ 128807 : '<|tool▁calls▁end|>' → [151658] '</tool_call>'
✔ 128809 : '<|tool▁call▁end|>' → [151658] '</tool_call>'
✔ 128811 : '<|tool▁outputs▁end|>' → [151658] '</tool_call>'
✔ 128813 : '<|tool▁output▁end|>' → [151658] '</tool_call>'
✔ 128814 : '<|tool▁sep|>' → [151658] '</tool_call>'

NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...

Transplanting tokens: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 128815/128815 [00:32<00:00, 3927.38token/s]

Transplant mappings:
- 1 to 1  : 83687 (65%)
- 2 to 1  : 38380 (30%)
- 3 to 1  : 4585 (3.6%)
- 4 to 1  : 923 (0.72%)
- 5 to 1  : 273 (0.21%)
- 6 to 1  : 91 (0.071%)
- 7 to 1  : 35 (0.027%)
- 8 to 1  : 22 (0.017%)
- 9 to 1  : 8 (0.0062%)
- 10 to 1 : 4 (0.0031%)
- 11 to 1 : 4 (0.0031%)
- 13 to 1 : 1 (0.00078%)
- 14 to 1 : 10 (0.0078%)
- 15 to 1 : 91 (0.071%)
- 16 to 1 : 699 (0.54%)
- 19 to 1 : 1 (0.00078%)
- 21 to 1 : 1 (0.00078%)

Head initialized with:
- Copies : 83687 (65%)
- Means  : 45128 (35%)
- Zeros  : 465 (0.36%)

Trimming layers 14 through 21 (inclusive): 
- Old layer count : 24 (layers 0-23)
- New layer count : 16 (keeping layers 0-13 and 22-23)
- Removed 96 tensors from state_dict
- Renamed 192 layer tensors to new indices
- Updated model configuration: num_hidden_layers = 16

Trimming hidden size from 896 to 512: 
- Old hidden size : 896
- New hidden size : 512
- Updated model configuration: hidden_size = 512
- Updated model configuration: num_attention_heads = 8
- Trimmed 163 tensors in state_dict

Trimming intermediate size from 4864 to 2048: 
- Old intermediate size : 4864
- New intermediate size : 2048
- Updated model configuration: intermediate_size = 2048
- Trimmed 48 tensors in state_dict

Output model configuration:
- Output vocabulary size    : 129280
- Output num layers         : 16 (tied embeddings = False)
- Output hidden size        : 512
- Output attention heads    : 8
- Output intermediate size  : 2048 (ratio = 1:4.0)
- Output total parameters   : 193229312 (0.19B)
-- Embedding parameters     : 132382720 (0.13B)
-- Non-embedding parameters : 60846592 (0.06B)

Even though the final model appears to be 0.19B parameters, notice the Non-embedding parameters:

- Donor total parameters    : 494032768 (0.49B)
-- Embedding parameters     : 136134656 (0.14B)
-- Non-embedding parameters : 357898112 (0.36B)
- Output total parameters   : 193229312 (0.19B)
-- Embedding parameters     : 132382720 (0.13B)
-- Non-embedding parameters : 60846592 (0.06B)

That's nearly 1/6th of the original size, and the "untied" copy of embed_tokens.weight will essentially have no effect (ie: it just acts as a lookup table and llama.cpp even keeps it in RAM because of this).

So the real/effective size (ie: the parameters that actually get multiplied in GEMM operations) is:

132382720/2 + 60846592 = 127037952 = ~0.13B

vs (for the original Qwen2.5-0.5B-Instruct):

136134656 + 357898112 = 494032768 = ~0.49B

Like I said before, this needs more training data to recover, so I am now using ~5B tokens taken from:

https://huggingface.co/datasets/agentlans/common-crawl-sample
https://huggingface.co/datasets/bigcode/the-stack-smol-xl
https://huggingface.co/datasets/open-thoughts/OpenThoughts-Unverified-173k
https://huggingface.co/datasets/cognitivecomputations/dolphin-r1/viewer/reasoning-deepseek (only the 300k deepseek-r1 part)

with the first two formatted just between <|end▁of▁sentence|> tags, and the second two using the proper deepseek-r1 Jinga template (with <think> tags added around the reasoning, etc).

I'm also not splitting into 2 stages and just running 2 epochs (ie: 10B tokens total) with a large-ish batch size:

image

It's nearly done the first epoch and looks to have recovered most of the performance (eval/mixed_data/top1_accuracy is what is most useful metric for draft models).

If this works then I will train up 4 more models of increasing sizes (0.2B = this, 0.25B, 0.33B, 0.5B and 0.6B = untrimmed with untiedlm_head), and maybe find the optimal draft-size which we can use for other models' drafts:

#10466

Conclusions and potential for running big LLMs on consumer grade GPUs:

Small draft model is needed (sine qua non). 0.5B size seems to work well. Any model in the range of 8G or above can benefit by distilling a 0.5B draft and speculating the model. Returns fall off rapidly as draft gets bigger, already questionable at 1.5B and not really useful at 3B draft. Coding is far more efficient than general text gen with speculation. Qwen 2.5 series is perfect for exploiting the potential of speculation.

@saood06
Copy link

saood06 commented Apr 2, 2025

If this works then I will train up 4 more models of increasing sizes (0.2B = this, 0.25B, 0.33B, 0.5B and 0.6B = untrimmed with untiedlm_head), and maybe find the optimal draft-size which we can use for other models' drafts:

Thanks for the update, and also thanks for doing all this.

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 2, 2025

I should add (for anybody wondering why trimming the tensors like this has any chance of working):

  • The trimmed tensors are still spanning a subspace of the original tensors, so all the 18T tokens that qwen used for training are still mostly what the final model will be composed of (vs my comparability minimal 10B tokens).
  • If there weren't any layer_norm modules then you could do way better than just trimming like this via SVD or random projection matrices.
  • Most of what is getting recovered during the "repair" stage is the layer_norm modules' gamma scale parameters:

image

as you can see the model's final hidden state's magnitude is completely off and this in turn causes the logits that get output from the lm_head transform to become too small, which in turn causes output of the Softmax function have too high Entropy:

image

(ie: the outputs will be almost uniform and similar to using a huge temperature at inference time)

As soon as this is recovered, then the model will have to adapt the remaining heads to take over the job of those we removed (but there is likely quite a lot of redundancy between different layers' heads anyway), but most of the degradation is due to the layer_norm scale parameter, and overall it's not as crazy idea as it first seems! 😁

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 2, 2025

If this works then I will train up 4 more models of increasing sizes (0.2B = this, 0.25B, 0.33B, 0.5B and 0.6B = untrimmed with untiedlm_head), and maybe find the optimal draft-size which we can use for other models' drafts:

Thanks for the update, and also thanks for doing all this.

No problem - I just love fiddling about with stuff like this and would really like to get r1 (and v3) running a bit faster myself!

I'm also intrigued to find out exactly how small of a draft model we can use for all models (ie: even trimming Qwen2.5-0.5B-Instruct to work with Qwen2.5-72B-Instruct might be useful).

@davidsyoung
Copy link

davidsyoung commented Apr 2, 2025

@jukofyork this is awesome work and I'm also very interested in this. Would you have code to reproduce all of this?

Also, is there a way to increase the number of attention heads? For ex. The Qwen 0.5B model has 14, whereas for vLLM tensor parallelism I actually need 16 to evenly split them over the GPUs.

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 2, 2025

@jukofyork this is awesome work and I'm also very interested in this. Would you have code to reproduce all of this?

The code for this is here:

https://github.com/jukofyork/transplant-vocab

but beware it's all a bit cronky/experimental currently.

I will split off the trimming code later this week and also add a third program I'm hoping to use to sample distributionally accurate outputs for any model as the only big model-specific datasets I can find are for r1 and without that it's going to be very hard to repeat this for other models.

Also, is there a way to increase the number of attention heads? For ex. The Qwen 0.5B model has 14, whereas for vLLM tensor parallelism I actually need 16 to evenly split them over the GPUs.

Yeah, you can quite easily do this, but it will likely need a very large amount of data to train on compared to reducing the heads.

If you need powers of 2 then I will have the 8-headed version ready for testing by tomorrow or Friday, and it would be useful to get feedback on it (eg: someone said the old Unsloth versions are using a different pad token, etc).

@jukofyork
Copy link
Collaborator

I'm using qlora-pipe for the training:

https://github.com/tdrussell/qlora-pipe

but am running into a few problems due to it not being intended to be used like this nor written with multinode training in mind (I have 6x A6000s in 3 machines).

So it may be better to use a different training engine like Unsloth or similar.

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 2, 2025

So now this PR has been merged:

#11397

I'm gonna revive my version of this PR.

But the question: is there actually any point in having the non-MLA version of this?

Adding the --mla option last time ended up about 3x more work than the actual model code, and realistically what is the point of using the non-MLA version if it uses 10GB per 2048 context?

I'm also a bit stumped with what to do with the attn_k_b and attn_v_b tensors:

  • Leaving as float32 would be most fitting and match the expert routing tensors, but this gives a big hit to performance.
  • Letting these be quantized freely is a really bad idea and seems to cause weird numerical problems and also seems to run really badly in the CUDA backend (also attn_v_b can only use the older pre-K quants due to having row lengths of 128).
  • attn_k_b can't be left as float16 or it overflows and not sure you would want to do this anyway.
  • attn_k_b works the best when left as bfloat16 but not all backends support this.

I honestly don't know now what to do with it and probably have to lean towards the leaving as float32 to be consistent with how llama.cpp handles other "do not quantize" tensors like this... :/

I'll hopefully have this back working later today so am open to inputs if anyone has a better idea for it?

@jukofyork
Copy link
Collaborator

One other advantage of leaving them as float32 is we could just leave them as attn_kv_b and do the slicing when loading (or if the --mla option is added; when needed for that).

This would probably help other backends that load the GGUF files (Ktransformers, etc) not have to deal with the missing attn_kv_b.

I see ikawrakow has added some code to his fork to allow this tensor to be sliced on loading so as to save people from having to requantise the model like this PR does.

I don't like the idea of slicing the quantised version of attn_kv_b though as this tensor seems to have the most potential to ruin the model with numerical problems :/

@davidsyoung
Copy link

@jukofyork this is awesome work and I'm also very interested in this. Would you have code to reproduce all of this?

The code for this is here:

https://github.com/jukofyork/transplant-vocab

but beware it's all a bit cronky/experimental currently.

I will split off the trimming code later this week and also add a third program I'm hoping to use to sample distributionally accurate outputs for any model as the only big model-specific datasets I can find are for r1 and without that it's going to be very hard to repeat this for other models.

Also, is there a way to increase the number of attention heads? For ex. The Qwen 0.5B model has 14, whereas for vLLM tensor parallelism I actually need 16 to evenly split them over the GPUs.

Yeah, you can quite easily do this, but it will likely need a very large amount of data to train on compared to reducing the heads.

If you need powers of 2 then I will have the 8-headed version ready for testing by tomorrow or Friday, and it would be useful to get feedback on it (eg: someone said the old Unsloth versions are using a different pad token, etc).

Sounds good to me!

I'm not sure a power-of-2 is needed, or if it needs to be the same tensor split on vllm, so I believe it might need to be 16 (as I'm running with TP=16) or a multiple of 16 (I'm running 16 gpus), but when you have the 8-headed version ready I'll test that and see if it works! Happy to test.

@jukofyork
Copy link
Collaborator

See #12725 for new version.

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 3, 2025

The tiny draft model was surprisingly good:

Create a single HTML file containing CSS and JavaScript to generate an animated weather card. The card should visually represent the following weather conditions with distinct animations: Wind: (e.g., moving clouds, swaying trees, or wind lines) Rain: (e.g., falling raindrops, puddles forming) Sun: (e.g., shining rays, bright background) Snow: (e.g., falling snowflakes, snow accumulating) Show all the weather card side by side The card should have a dark background. Provide all the HTML, CSS, and JavaScript code within this single file. The JavaScript should include a way to switch between the different weather conditions (e.g., a function or a set of buttons) to demonstrate the animations for each.

0.2B model with 12 layers and 8 heads after fine-tuning

draft acceptance rate = 0.53575

0.6B model with 24 layers and 12 heads without fine-tuning

draft acceptance rate = 0.59366

but it is a bit too small based on these comparisons; considering the 0.6B model "without fine-tuning" is basically just using qwen language distribution!


I've also realised that the most important thing to trim is the number of heads / hidden size - so that we can quantise it to use the non-legacy quants! In terms of raw throughput there is a huge gap between Q4_0 and Q8_0, but not being able to use anything between is quite limiting...

So I'm now going to limit myself to models with 8 or 12 heads only, starting with the 0.33B you get when you trim the heads / hidden size to 512 (but leave the 24 layers and the 1:9.5 hidden:intermediate ratio), then the 0.5B you get when you trim the heads / hidden size to 768 (again leaving the 24 layers and intermediate size unchanged), and finally the untrimmed 0.6B which is getting 59% draft acceptance rate already (see above):

Output model configuration:
- Output vocabulary size    : 129280
- Output num layers         : 24 (tied embeddings = False)
- Output hidden size        : 512
- Output attention heads    : 8
- Output intermediate size  : 4864 (ratio = 1:9.5)
- Output total parameters   : 327461376 (0.33B)
-- Embedding parameters     : 132382720 (0.13B)
-- Non-embedding parameters : 195078656 (0.20B)
Output model configuration:
- Output vocabulary size    : 129280
- Output num layers         : 24 (tied embeddings = False)
- Output hidden size        : 768
- Output attention heads    : 12
- Output intermediate size  : 4864 (ratio = 1:6.3)
- Output total parameters   : 500626176 (0.50B)
-- Embedding parameters     : 198574080 (0.20B)
-- Non-embedding parameters : 302052096 (0.30B)
Output model configuration:
- Output vocabulary size    : 129280
- Output num layers         : 24 (tied embeddings = False)
- Output hidden size        : 896
- Output attention heads    : 14
- Output intermediate size  : 4864 (ratio = 1:5.4)
- Output total parameters   : 589567872 (0.59B)
-- Embedding parameters     : 231669760 (0.23B)
-- Non-embedding parameters : 357898112 (0.36B)

@saood06
Copy link

saood06 commented Apr 4, 2025

@jukofyork

Are you familiar with this:

The initial curated dataset consisted of approximately 500 million tokens.. During the initial test runs, the total size of this dataset, with logits, was calculated to be 2.9 Petabytes, which was unsustainable for our hardware resources. To mitigate this, we developed a method of compressing logits, ultimately reducing the final dataset size to 50GB. This compression process is pivotal, enabling us to move forward with the distillation process without incurring extreme resource costs. We are currently exploring the possibility of releasing a formal paper on this distillation method, though internal discussions regarding how much to share publicly are ongoing. What we can confidently share is that you do not need all the logits per sample to perform effective distillation.

from https://www.arcee.ai/blog/arcee-supernova-training-pipeline-and-model-composition (more of the article might be relevant to you because what you are doing)

I think this just means take the logits from tokens that are above 0 probability from a sane sampler (which is how I store my local history as).

and this newer distill article they have https://www.arcee.ai/blog/virtuoso-lite-virtuoso-medium-v2-distilling-deepseek-v3-into-10b-32b-small-language-models-slms

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 4, 2025

@jukofyork

Are you familiar with this:

The initial curated dataset consisted of approximately 500 million tokens.. During the initial test runs, the total size of this dataset, with logits, was calculated to be 2.9 Petabytes, which was unsustainable for our hardware resources. To mitigate this, we developed a method of compressing logits, ultimately reducing the final dataset size to 50GB. This compression process is pivotal, enabling us to move forward with the distillation process without incurring extreme resource costs. We are currently exploring the possibility of releasing a formal paper on this distillation method, though internal discussions regarding how much to share publicly are ongoing. What we can confidently share is that you do not need all the logits per sample to perform effective distillation.

from https://www.arcee.ai/blog/arcee-supernova-training-pipeline-and-model-composition (more of the article might be relevant to you because what you are doing)

I think this just means take the logits from tokens that are above 0 probability from a sane sampler (which is how I store my local history as).

and this newer distill article they have https://www.arcee.ai/blog/virtuoso-lite-virtuoso-medium-v2-distilling-deepseek-v3-into-10b-32b-small-language-models-slms

Thanks, I'm not familiar with this exactly but already know how to do this :)

There's nothing that special about "distillation" and you can easily rearrange the softmax loss to show that training on the real-valued targets (instead of the one-hot targets) is equivalent to taking each of your n samples in your dataset, expanding it into m one-hot samples and then using the original real-valued targets to weight the m one-hot samples (ie: creating a new n*m sized weighted dataset).

This is actually quite a well known trick people have used in software to get round the limitations of only having one-hot training available (eg: https://stackoverflow.com/questions/46977313/scikit-learn-multinomial-logistic-regression-with-probabilities-as-a-target-va).

So once you see distillation like this, it should be clear that:

  • In the limit it will give the same results as training on one-hot encoded targets, but is just more sample efficient.
  • You can still get some (or most) of the sample efficiency gains without using the full set of expanded target values.

In practice you won't actually even be expanding and reweighing like this, and just using the top few targets and/or treating the rest as zero or something similar to label-smoothing.


But there's a few reasons why I'm not trying this yet:

  1. At 5 tokens per second I can only generate around 125M tokens per year per machine, and I've basically got 2B one-hot target data (generated from the full r1 model) off huggingface for free!

  2. I'm using qlora-pipe for the training as it was the only thing I've managed to get working over multiple nodes, and it uses the Unsloth cross-entropy kernel as the Pytorch version is really slow for large vocabs.

I actually adapted this to use label smoothing about a year ago, and it took me about 5 minutes to get the non-chunked version working, a day to get the derivatives for the chunked version working, and I gave up trying to get the loss calculated for the chunked version after several days and just accepted it would print the wrong loss lol!

  1. I think hinge loss or one of its smoothed variants may actually be closer to what we want out of a pure speculative decoding model (see this thread for more discussion).

@saood06
Copy link

saood06 commented Apr 4, 2025

There's nothing that special about "distillation" and you can easily rearrange the softmax loss to show that training on the real-valued targets (instead of the one-hot targets) is equivalent to taking each of your n samples in your dataset, expanding it into m one-hot samples and then using the original real-valued targets to weight the m one-hot samples (ie: creating a new n*m sized weighted dataset).

That makes sense.

I agree with all you said except for this

At 5 tokens per second I can only generate around 125M tokens per year per machine

I am certain your machine can generate FAR more when batching, as your machine is a lot faster than mine (and mine is also CPU only), and I just finished testing batched performance and my machine peaked at 12.91 t/s (at a batch size of 12). But it still would be out of reach for one machine to do in a reasonable time.

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 4, 2025

There's nothing that special about "distillation" and you can easily rearrange the softmax loss to show that training on the real-valued targets (instead of the one-hot targets) is equivalent to taking each of your n samples in your dataset, expanding it into m one-hot samples and then using the original real-valued targets to weight the m one-hot samples (ie: creating a new n*m sized weighted dataset).

That makes sense.

I agree with all you said except for this

At 5 tokens per second I can only generate around 125M tokens per year per machine

I am certain your machine can generate FAR more when batching, as your machine is a lot faster than mine (and mine is also CPU only), and I just finished testing batched performance and my machine peaked at 12.91 t/s (at a batch size of 12). But it still would be out of reach for one machine to do in a reasonable time.

Yeah, I can get about 20 tokens per second batching, but sadly (I think?) for multiple sequences this requires the KV-cache to also be duplicated? If so, then I can't really do this as using a sequence length of 32k for the draft model training.

I agree I can probably get a bit more out of batching though, but even at 20 tokens per second it's only 500M tokens per year! 😦


This being said, my ultimate goal with all this is to try to "pre-trim" tiny models like qwen:0.5b and llama:1b (which I can generate a crazy amount of data for in a few days!), keeping the original vocabulary, and then use these for the actual "vocabulary transplant" for other models using whatever data; general or model-generated, you can get your hands on.

The only reason it's doing it the way it is currently is because the trimming was an afterthought and transplant_vocab.py has ended up a tangled mess 😬

@saood06
Copy link

saood06 commented Apr 5, 2025

Yeah, I can get about 20 tokens per second batching,

That feels low, my results (but this is on ik_llama).

430406235-17432dc5-5d14-41a8-870f-00e3540c317d

but sadly (I think?) for multiple sequences this requires the KV-cache to also be duplicated? If so, then I can't really do this as using a sequence length of 32k for the draft model training.

vram-usage

Source: https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF

Assuming your batching performance is similar to mine a batch size of 6 would get you a significant amount of the potential improvement and using the VRAM to KV context graph in the huggingface it would fit on your A6000.

I agree I can probably get a bit more out of batching though, but even at 20 tokens per second it's only 500M tokens per year! 😦

Yes. I agree on the fact that it isn't feasible, but I just knew your system could do more than 5.

@jukofyork
Copy link
Collaborator

https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0

https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF

I'm still waiting for somebody to show me a printout of the token IDs for the Unsloth quants as apparently they changed the <PAD> token ID for some reason, and these don't work because of that.

@jukofyork
Copy link
Collaborator

I've got the results for a couple of perplexity runs with higher quality quants:

1. With token_embd/attn_k_b/attn_v_b using BF16, and all others Q8_0:

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  663 tensors
llama_model_loader: - type bf16:  123 tensors
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
.
.
.
Final estimate: PPL = 3.3470 +/- 0.01847

I've found something else that might be useful for all backends and not just llama.cpp or MLA-specific:

If we run with just 6 of the experts using --override-kv "deepseek2.expert_used_count=int:6" then we get a drop in perplexity as expected:

Final estimate:  PPL = 3.3952 +/- 0.01865

but if we look at llm_graph_context::build_moe_ffn(), we can see for deepseek it is performing this scale operation to the weighting factors:

    if (scale_w) {
        weights = ggml_scale(ctx0, weights, w_scale);
        cb(weights, "ffn_moe_weights_scaled", il);
    }

Theory tells us that if we have a set of i.i.d. vectors with the same norms, then when averaged the norm of 8 such vectors will be ~1.15x larger than 6 such vectors (ie: sqrt(8) / sqrt(6)).

Obviously the vectors aren't necessarily i.i.d with the same norm, nor are we taking an equally weighted arithmetic mean of them (ie: it's likely to be quite skewed due to the top-k operation before), but this is likely a good lower bound on what we need to change the scale-factor to (ie: 2.5 * sqrt(6/8) = ~2.165):

expert_weights_scale PPL
2.5 PPL = 3.3952 +/- 0.01865
2.35 PPL = 3.3809 +/- 0.01867
2.3 PPL = 3.3825 +/- 0.01872
2.25 PPL = 3.3849 +/- 0.01880
2.165 PPL = 3.3921 +/- 0.01893

It's not a huge gain and just outside the confidence interval, but does match what we would expect from the theory and has a clear pattern.

So using:

--override-kv "deepseek2.expert_used_count=int:6" --override-kv "deepseek2.expert_weights_scale=float:2.3"

looks to gain about as much PPL as going from Q8_0 to Q4_K for the experts (based on my tests earlier in this thread):

(3.3825 / 3.3470 −1) × 100 = ~1% PPL increase.

Anyway, I thought I'd share this as thought it was interesting and might help a little.

@fairydreaming
Copy link
Collaborator Author

Obsoleted by #12801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.