-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for DeepseekV2ForCausalLM #7519
Conversation
- leading_dense_block_count => hparams.n_leading_dense_layer, - expert_feed_forward_length => hparams.n_expert_ff, - expert_shared_count => hparams.n_expert_shared, - attention.q_lora_rank => hparams.n_lora_q, - attention.kv_lora_rank => hparams.n_lora_kv
Added missing scaling of kq_scale parameter.
…ts in the final mscale value equal to 1.0.
… multiplier of the ln(s) from the sqrt(1/t) = 0.1 ln(s) + 1 equation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
I've added a few name suggestions at some places, but feel free to choose something else if you prefer
One thing I'm not sure about - shall I print all new parameter values in llm_load_print_meta() or print them only for LLM_ARCH_DEEPSEEK2 models (since they are specific to this architecture)? Or perhaps simply not print them at all?
Let's print them for the specific arch
llama.cpp
Outdated
k_pe = ggml_repeat(ctx0, k_pe, q_pe); | ||
cb(k_pe, "k_pe", il); | ||
|
||
struct ggml_tensor * key_states = ggml_new_tensor_3d(ctx0, q_nope->type, hparams.n_embd_head_k, n_head, n_tokens); | ||
cb(key_states, "key_states", il); | ||
key_states = ggml_set_inplace(ctx0, key_states, k_nope, key_states->nb[1], key_states->nb[2], key_states->nb[3], 0); | ||
key_states = ggml_set_inplace(ctx0, key_states, k_pe, key_states->nb[1], key_states->nb[2], key_states->nb[3], ggml_element_size(key_states) * qk_nope_head_dim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to implement GGML_OP_REPEAT
for Metal - I can try to do that tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And GGML_OP_SET
as well. I am not quite sure what this op does. Could it be replaced with a ggml_cpy
into views?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible, though I think it would require ggml_build_forward_expands
which is not very nice
I think this OP does something like this:
k_pe: kkk
q_pe: qqq qqq qqq
k_nope: nnn
key_states: nnn kkk kkk kkk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having ggml_new_tensor_3d
is not great either, they will be treated as inputs in ggml-alloc and allocated at the beginning of the compute buffer, which will increase memory usage since there is one for each layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to spend the time to make a more general ggml_concat
operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, how would broadcasting work with ggml_concat
? This is the current API:
// concat a and b along dim
// used in stable-diffusion
GGML_API struct ggml_tensor * ggml_concat(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int dim);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rebase on master
and apply the following patch, it should work with CUDA and Metal:
diff --git a/llama.cpp b/llama.cpp
index cef5bfdd..9c80a621 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -11305,18 +11305,11 @@ struct llm_build_context {
);
cb(k_pe, "k_pe", il);
- struct ggml_tensor * q_states = ggml_new_tensor_3d(ctx0, q_nope->type, hparams.n_embd_head_k, n_head, n_tokens);
+ struct ggml_tensor * q_states = ggml_concat(ctx0, q_nope, q_pe, 0);
cb(q_states, "q_states", il);
- q_states = ggml_set_inplace(ctx0, q_states, q_nope, q_states->nb[1], q_states->nb[2], q_states->nb[3], 0);
- q_states = ggml_set_inplace(ctx0, q_states, q_pe, q_states->nb[1], q_states->nb[2], q_states->nb[3], ggml_element_size(q_states) * n_embd_head_qk_nope);
- k_pe = ggml_repeat(ctx0, k_pe, q_pe);
- cb(k_pe, "k_pe", il);
-
- struct ggml_tensor * k_states = ggml_new_tensor_3d(ctx0, q_nope->type, hparams.n_embd_head_k, n_head, n_tokens);
+ struct ggml_tensor * k_states = ggml_concat(ctx0, k_nope, ggml_repeat(ctx0, k_pe, q_pe), 0);
cb(k_states, "k_states", il);
- k_states = ggml_set_inplace(ctx0, k_states, k_nope, k_states->nb[1], k_states->nb[2], k_states->nb[3], 0);
- k_states = ggml_set_inplace(ctx0, k_states, k_pe, k_states->nb[1], k_states->nb[2], k_states->nb[3], ggml_element_size(k_states) * n_embd_head_qk_nope);
cur = llm_build_kv(ctx0, model, hparams, cparams, kv_self, gf,
model.layers[il].wo, NULL,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, how would broadcasting work with
ggml_concat
? This is the current API:
@ggerganov forget it, it seems that even in numpy and pytorch concat doesn't support broadcasting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rebase on
master
and apply the following patch, it should work with CUDA and Metal:diff --git a/llama.cpp b/llama.cpp index cef5bfdd..9c80a621 100644 --- a/llama.cpp +++ b/llama.cpp @@ -11305,18 +11305,11 @@ struct llm_build_context { ); cb(k_pe, "k_pe", il); - struct ggml_tensor * q_states = ggml_new_tensor_3d(ctx0, q_nope->type, hparams.n_embd_head_k, n_head, n_tokens); + struct ggml_tensor * q_states = ggml_concat(ctx0, q_nope, q_pe, 0); cb(q_states, "q_states", il); - q_states = ggml_set_inplace(ctx0, q_states, q_nope, q_states->nb[1], q_states->nb[2], q_states->nb[3], 0); - q_states = ggml_set_inplace(ctx0, q_states, q_pe, q_states->nb[1], q_states->nb[2], q_states->nb[3], ggml_element_size(q_states) * n_embd_head_qk_nope); - k_pe = ggml_repeat(ctx0, k_pe, q_pe); - cb(k_pe, "k_pe", il); - - struct ggml_tensor * k_states = ggml_new_tensor_3d(ctx0, q_nope->type, hparams.n_embd_head_k, n_head, n_tokens); + struct ggml_tensor * k_states = ggml_concat(ctx0, k_nope, ggml_repeat(ctx0, k_pe, q_pe), 0); cb(k_states, "k_states", il); - k_states = ggml_set_inplace(ctx0, k_states, k_nope, k_states->nb[1], k_states->nb[2], k_states->nb[3], 0); - k_states = ggml_set_inplace(ctx0, k_states, k_pe, k_states->nb[1], k_states->nb[2], k_states->nb[3], ggml_element_size(k_states) * n_embd_head_qk_nope); cur = llm_build_kv(ctx0, model, hparams, cparams, kv_self, gf, model.layers[il].wo, NULL,
@ggerganov Done, looks much cleaner now, thank you! Works without issues (tested on CPU).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I also tested the Lite model with Metal. Let's merge after @slaren approval
|
@foldl While I agree that there may be some issues with YaRN implementation for models using NeoX-style RoPE (at least by looking at the code), I think DeepSeek-V2 is not affected by them, as it falls into preceding conditional block ( |
@fairydreaming Got it. Your PR uses RoPE type |
…_head_qk_rope, n_embd_head_qk_nope
…ace with single ggml_concat in build_deepseek2()
Do we expect this to not work on CUDA? Trying to make imatrix for https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat Results in: GGML_ASSERT: ggml-cuda/concat.cu:107: ggml_is_contiguous(src0) |
I got the same issue
I built with CMake:
Following is the log when running
|
I'll try to fix this now - I didn't realize the concat tensors in DS2 are not contiguous |
Please test #7610 to see if it works now |
@ggerganov I think there is still something wrong with the CUDA implementation. The more layers I offload to the GPU, the more broken the model output is. Is there any way to print tensors from the CUDA code? I'd love to be able to add GGML graph nodes for printing tensors, something like |
You can already do that with the eval-callback example |
@ggerganov The first meaningful difference is in the |
@ggerganov Another difference in tensor values I found was in norm-26 = (f32) RMS_NORM(compressed_kv-26). So it looks like normalization is also affected by non-contiguous tensors. After adding the following changes:
I can offload the whole model to GPU and it works correctly. |
Hm, it's strange that we are missing the proper asserts for these. I thought at least for But anyway, it would probably be better to support non-contiguous rope and norm. I'll also have to double-check the Metal backend and add tests |
I've started adding tests and fixing some of the RoPE issues in: #7617 I think we should try to avoid |
k_pe = ggml_rope_ext( | ||
ctx0, ggml_view_3d(ctx0, k_pe, n_embd_head_qk_rope, 1, n_tokens, k_pe->nb[0], k_pe->nb[1], 0), inp_pos, nullptr, | ||
n_rot, rope_type, 0, n_orig_ctx, freq_base, freq_scale, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct here that we rotate just the first head of k_pe
?
With DS2 Lite, there is a second head here which seems to be simply discarded, if I understand correctly
Edit: nvm got it
@fairydreaming Could you give #7617 a try and see if it works correctly? I ended up implementing your patch because the idea that I had is not going to work. The partial RoPE requires that the rotated dimensions are at the beginning of the head, but here they are at the end. There also seem to be other limitations, so I guess the current The CUDA kernels could be extended to support non-contiguous RoPE and norm, but I'm not sure if this is going to be worth it - I guess it might even hurt the performance. So instead, I added asserts to prevent from using non-contiguous data with these ops I've also simplified some of the stride and offset computations in the DS2 compute graph by using the |
@ggerganov Sure, seems to work OK both on CPU and on GPU (CUDA).
Roger that, I checked the performance difference on the CPU caused by added without ggml_cont:
with ggml_cont:
If needed I guess there could be separate implementations for contiguous and non-contiguous tensors. However, it looks like a lot of work and the efficiency gains won't be large (or like you said on CUDA the performance may be even worse), so IMHO it can stay this way. |
Possibly relevant - #2445 (comment) |
This pull request adds support for DeepseekV2ForCausalLM-based models. Both lite and non-lite models are supported. Fixes #7118
Changes included in this pull request:
This pull request also fixes #7331 by removing the failing assertion.
Note that I had to change the llm_build_moe_ffn() API to add scaling of MoE gate output - added a bool indicating whether to scale or not and a float scale value. Let me know if there is a better way.
Implementation also required somewhat ugly workaround for problems with YaRN implementation that I described in detail in #7416
One thing I'm not sure about - shall I print all new parameter values in llm_load_print_meta() or print them only for LLM_ARCH_DEEPSEEK2 models (since they are specific to this architecture)? Or perhaps simply not print them at all?
I'll correct the whitespace formatting after we agree on the parameter names.