-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Optimized DeepSeek V2/V3 implementation (MLA) #11446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…kv representations
…nsposing the cache during inference
@fairydreaming do you have a converted model available or instructions for replicating your setup? I would like to run some benchmarks on these changes. |
@wronkiew What model would you like to test? |
V3/R1, Q4_K_S. |
@wronkiew I don't have the model uploaded (my upload bandwidth is too low), you have to download, convert to bf16, convert to gguf and quantize the original model by yourself (or download one that is already converted to bf16, this will save you one step). |
I spent some time investigating this hint from the DeepSeek V2 paper:
At first glance it looks reasonable, each absorbed matrix allows to replace two matrix multiplications with a single multiplication, thus reducing the number of operations. However when we take a look into dimensions of these matrices, this stops being reasonable. For example in DeepSeek V2 lite:
So (let's ignore the head dimension) this allows to replace two multiplications: with [2048, 128] matrix and [512, 128] matrix with a single multiplication with a [512, 2048]. The combined matrix has over 3x elements compared to individual matrices, so it will take more memory and it will be actually slower to multiply compared to two multiplications with smaller matrices. With
I also found this blog post: https://github.com/xjdr-alt/mla_blog_translation where they mention:
So it looks like a dead end, it won't give us any speed gains. |
I ran into an issue with DeepSeek-R1-UD-Q2_K_XL from unsloth/DeepSeek-R1-GGUF
|
As I wrote in the PR:
Existing GGUFs won't work, you have to convert and quantize one with the code from this PR. |
Ohh hmm should I re-quantize the ones in https://huggingface.co/unsloth/DeepSeek-R1-GGUF? |
I think it's best to wait a bit until this is stable and merged, it's possible that there will be some changes that would cause them to stop working and you'd have to repeat the conversion again. |
I updated the token generation performance plots in the PR post, also added some new showing the prompt processing performance. The optimized implementation generally performs WORSE in prompt processing - DeepSeek R1 671B Q4_K_S running on CPU performs only a little worse (~10% with 4k prompt), but DeepSeek V2 Lite Q8_0 running on RTX 4090 performs MUCH WORSE (~30% with 16k prompt) and in both cases the gap widens as the prompt length increases. So it's not all sunshine and rainbows. Considering all these performance regressions I think the best course of action would be to put the optimized implementation into separate model architecture ( |
// whether to use n_tokens as the matrix dimension during multiplication or n_head | ||
// n_tokens is higher during prompt processing, this allows to optimize for this case | ||
bool pp_opt = n_tokens > n_head; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure this is the right approach. Haven't followed through the logic yet, but it seems strange to involve so many permutes and conts.
I would first look into improving the FA kernels to support DeepSeek head sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure this is the right approach. Haven't followed through the logic yet, but it seems strange to involve so many permutes and conts.
Hmm? I'm quite sure there's only one ggml_cont() call (excluding the ones for CUDA compatibility that already existed in the previous implementation).
As for the permutes the idea is to multiply by a matrix with a second dimension equal to the number of heads instead of the number of tokens (which is 1) during a single sequence token generation, that increased the performance on a CPU a bit.
So during prompt processing we have 2 permutes and 1 cont. During token generation we have 5 permutes (yeah, that may be a lot) and 0 conts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the correction - I did imagine the extra conts when I saw the permutes.
While this is possible to do, I think it has a lot of cons. It will make it difficult for everyone to know which model variation on which hardware to use for better performance. Ideally, we want to have a single implementation that is optimal in all use cases, which can be deprecated at some point for a better alternative. But having 2 alternatives neither of which is optimal is not great. Also, I'm not sure how this implementation fits with multiple parallel sequences and it introduces extra KV cache logic, specific to this type of arch. I know there is a lot of interest in the DeepSeek arch right now and such optimizations are really important for people. But I think that we have to keep this work in a PR for a while. It is much more important to fix the software architecture in |
That may not be possible - IMHO MLA attention implementation that caches "compressed" latent kv representations introduces unavoidable computational overhead due to the need to "decompress" these representations in order to calculate attention scores and attention output. So "naive" attention implementation that caches full K/V vectors will always use less compute but more memory bandwidth, while caching latent representations results in using more compute, but less memory bandwidth. So there can't be a single implementation optimal in all use cases. I'd be happy to be proven wrong about this, though.
I think there shouldn't be any problems with this, as there is a straightforward direct mapping between the cached representations and full K/V vectors.
That's fine with me. I'm taking a break from this anyway, got bored with tensor shuffling looking for 0.1 t/s more performance. 😉 |
@fairydreaming
I don't have a quant on hand that I can test without this branch, but this branch does give me a nice performance boost for TG at longer contexts, but RPC to CUDA does not work. |
If you want to quantify this you might be able to modify the hellaswag or winogrande sections of the perplexity example to also measure how many thought tokens, haven't looked into so maybe another test harness might be easier, but I agree that perplexity numbers wouldn't really give much insight into that. Edit: This article https://mp.weixin.qq.com/s/vIrvbVJ6Nv00Ehre1zZwMw says that they found more thought tokens generated at Q4 than at Q8 (Machine translated part of the article):
|
Probably depends on the quantization method (how outliers are handled) not necessarily the degree of quantization. |
@fairydreaming Sorry I don't have a Reddit account, but just to say you can definitely order the RTX PRO 6000 in the UK: I've preordered a 300w Max-Q and they emailed me to say it was expected 21st April (scan.co.uk is usually pretty accurate with its restock dates). I'm surprised they don't have a queue for these like they do for the RTX 5090s: https://www.scan.co.uk/nvidia/rtx-50-series I'm hoping most scalpers will go for the 600w version (which is expected 30th April supposedly), but that doesn't seem to have a queue either surprisingly and seems pretty good value compared to the 5090s if you can deal with the heat? (I'll likely be underwatting mine to 200w and only care about the 1.7T/s memory bandwidth...) |
I should hopefully have some good draft models for I wasn't having much luck until someone linked me this paper: https://huggingface.co/rdsm/QwenPhi-4-0.5b-Draft/discussions/1#67e97e3c6887b70da5509715 and I think I've finally got it working properly by including 30% raw code in the mix - without this it seems to lose its ability to work well as a draft model altogether: I don't really understand why this would have such a huge effect (30% vs 65%+ acceptance rate!?), but it definitely seems super-important. I think I've also perfected "trimming" the
I've successfully removed 1/2 the layers and also halved the intermediate size (which was surprisingly large at 5.7x the hidden size in the I will post the HF links when they are done later this week. |
The MLA version is a bit strange as it's the I've ended up just leaving this as |
Because they are testing the accept rate while inferencing code the table right below it (Table 9 pasted below) shows that code matters less for nnatural langauge tasks: This is still interesting because it shows mostly code and a little bit of natural language in your dataset mix is better than mostly natural language and some code even when the text is natural language based. It might be too late, but maybe trying having majority code be your dataset mix? |
I've already restarted it (for a different reason I'll explain below), but I did see that but thought the differences weren't really enough justification to do this:
The key finding for this paper, that was very different to turboderp's original So I've now restarted as I've found a way to trim the hidden state / heads (as well as the layers and intermediate size), and this needs more data to recover. Here is a trimmed version of
Even though the final model appears to be
That's nearly 1/6th of the original size, and the "untied" copy of So the real/effective size (ie: the parameters that actually get multiplied in GEMM operations) is:
vs (for the original
Like I said before, this needs more training data to recover, so I am now using ~5B tokens taken from: https://huggingface.co/datasets/agentlans/common-crawl-sample with the first two formatted just between I'm also not splitting into 2 stages and just running 2 epochs (ie: 10B tokens total) with a large-ish batch size: It's nearly done the first epoch and looks to have recovered most of the performance ( If this works then I will train up 4 more models of increasing sizes (
|
Thanks for the update, and also thanks for doing all this. |
I should add (for anybody wondering why trimming the tensors like this has any chance of working):
as you can see the model's final hidden state's magnitude is completely off and this in turn causes the logits that get output from the (ie: the outputs will be almost uniform and similar to using a huge temperature at inference time) As soon as this is recovered, then the model will have to adapt the remaining heads to take over the job of those we removed (but there is likely quite a lot of redundancy between different layers' heads anyway), but most of the degradation is due to the |
No problem - I just love fiddling about with stuff like this and would really like to get I'm also intrigued to find out exactly how small of a draft model we can use for all models (ie: even trimming |
@jukofyork this is awesome work and I'm also very interested in this. Would you have code to reproduce all of this? Also, is there a way to increase the number of attention heads? For ex. The Qwen 0.5B model has 14, whereas for vLLM tensor parallelism I actually need 16 to evenly split them over the GPUs. |
The code for this is here: https://github.com/jukofyork/transplant-vocab but beware it's all a bit cronky/experimental currently. I will split off the trimming code later this week and also add a third program I'm hoping to use to sample distributionally accurate outputs for any model as the only big model-specific datasets I can find are for
Yeah, you can quite easily do this, but it will likely need a very large amount of data to train on compared to reducing the heads. If you need powers of 2 then I will have the 8-headed version ready for testing by tomorrow or Friday, and it would be useful to get feedback on it (eg: someone said the old Unsloth versions are using a different pad token, etc). |
I'm using https://github.com/tdrussell/qlora-pipe but am running into a few problems due to it not being intended to be used like this nor written with multinode training in mind (I have 6x A6000s in 3 machines). So it may be better to use a different training engine like Unsloth or similar. |
So now this PR has been merged: I'm gonna revive my version of this PR. But the question: is there actually any point in having the non-MLA version of this? Adding the I'm also a bit stumped with what to do with the
I honestly don't know now what to do with it and probably have to lean towards the leaving as I'll hopefully have this back working later today so am open to inputs if anyone has a better idea for it? |
One other advantage of leaving them as This would probably help other backends that load the GGUF files (Ktransformers, etc) not have to deal with the missing I see ikawrakow has added some code to his fork to allow this tensor to be sliced on loading so as to save people from having to requantise the model like this PR does. I don't like the idea of slicing the quantised version of |
Sounds good to me! I'm not sure a power-of-2 is needed, or if it needs to be the same tensor split on vllm, so I believe it might need to be 16 (as I'm running with TP=16) or a multiple of 16 (I'm running 16 gpus), but when you have the 8-headed version ready I'll test that and see if it works! Happy to test. |
See #12725 for new version. |
The tiny draft model was surprisingly good:
0.2B model with 12 layers and 8 heads after fine-tuning
0.6B model with 24 layers and 12 heads without fine-tuning
but it is a bit too small based on these comparisons; considering the 0.6B model "without fine-tuning" is basically just using I've also realised that the most important thing to trim is the number of heads / hidden size - so that we can quantise it to use the non-legacy quants! In terms of raw throughput there is a huge gap between So I'm now going to limit myself to models with 8 or 12 heads only, starting with the
|
Are you familiar with this:
from https://www.arcee.ai/blog/arcee-supernova-training-pipeline-and-model-composition (more of the article might be relevant to you because what you are doing) I think this just means take the logits from tokens that are above 0 probability from a sane sampler (which is how I store my local history as). and this newer distill article they have https://www.arcee.ai/blog/virtuoso-lite-virtuoso-medium-v2-distilling-deepseek-v3-into-10b-32b-small-language-models-slms |
Thanks, I'm not familiar with this exactly but already know how to do this :) There's nothing that special about "distillation" and you can easily rearrange the softmax loss to show that training on the real-valued targets (instead of the one-hot targets) is equivalent to taking each of your n samples in your dataset, expanding it into m one-hot samples and then using the original real-valued targets to weight the m one-hot samples (ie: creating a new n*m sized weighted dataset). This is actually quite a well known trick people have used in software to get round the limitations of only having one-hot training available (eg: https://stackoverflow.com/questions/46977313/scikit-learn-multinomial-logistic-regression-with-probabilities-as-a-target-va). So once you see distillation like this, it should be clear that:
In practice you won't actually even be expanding and reweighing like this, and just using the top few targets and/or treating the rest as zero or something similar to label-smoothing. But there's a few reasons why I'm not trying this yet:
I actually adapted this to use label smoothing about a year ago, and it took me about 5 minutes to get the non-chunked version working, a day to get the derivatives for the chunked version working, and I gave up trying to get the loss calculated for the chunked version after several days and just accepted it would print the wrong loss lol!
|
That makes sense. I agree with all you said except for this
I am certain your machine can generate FAR more when batching, as your machine is a lot faster than mine (and mine is also CPU only), and I just finished testing batched performance and my machine peaked at 12.91 t/s (at a batch size of 12). But it still would be out of reach for one machine to do in a reasonable time. |
Yeah, I can get about 20 tokens per second batching, but sadly (I think?) for multiple sequences this requires the KV-cache to also be duplicated? If so, then I can't really do this as using a sequence length of 32k for the draft model training. I agree I can probably get a bit more out of batching though, but even at 20 tokens per second it's only 500M tokens per year! 😦 This being said, my ultimate goal with all this is to try to "pre-trim" tiny models like The only reason it's doing it the way it is currently is because the trimming was an afterthought and |
That feels low, my results (but this is on ik_llama).
Source: https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF Assuming your batching performance is similar to mine a batch size of 6 would get you a significant amount of the potential improvement and using the VRAM to KV context graph in the huggingface it would fit on your A6000.
Yes. I agree on the fact that it isn't feasible, but I just knew your system could do more than 5. |
https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0 https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF I'm still waiting for somebody to show me a printout of the token IDs for the Unsloth quants as apparently they changed the |
I've found something else that might be useful for all backends and not just If we run with just 6 of the experts using
but if we look at if (scale_w) {
weights = ggml_scale(ctx0, weights, w_scale);
cb(weights, "ffn_moe_weights_scaled", il);
} Theory tells us that if we have a set of i.i.d. vectors with the same norms, then when averaged the norm of 8 such vectors will be ~1.15x larger than 6 such vectors (ie: Obviously the vectors aren't necessarily i.i.d with the same norm, nor are we taking an equally weighted arithmetic mean of them (ie: it's likely to be quite skewed due to the top-k operation before), but this is likely a good lower bound on what we need to change the scale-factor to (ie:
It's not a huge gain and just outside the confidence interval, but does match what we would expect from the theory and has a clear pattern. So using:
looks to gain about as much PPL as going from
Anyway, I thought I'd share this as thought it was interesting and might help a little. |
Obsoleted by #12801 |
This PR introduces various optimizations for DeepSeek V2/V3 implementation:
Note that you need to reconvert the model to use this implementation.
Performance compared to the previous "naive" implementation:
CUDA performance is worse for short context lengths, but the curve is flatter:
TODO:
address regressions in prompt processing performance (different permutations of tensors?)- I don't think it's possible, as this implementation is more compute-intensive compared to regular attention implementation