Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Decoding special tokens in T5 #8938

Closed
cyanic-selkie opened this issue Aug 8, 2024 · 6 comments · Fixed by #8951
Closed

Bug: Decoding special tokens in T5 #8938

cyanic-selkie opened this issue Aug 8, 2024 · 6 comments · Fixed by #8951
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)

Comments

@cyanic-selkie
Copy link

What happened?

I have a T5/lora model trained to output some text separated by the <extra_id_0> special token (the tokenizer properly works after following instructions in #8872) .

When running the model using Huggingface's transformers/peft, it generates the expected output. However, when I use llama-cli, what happens instead is that the moment the first such token is reached, it's actually decoded into an EOG token instead of the extra token and generation is stopped.

I might be simply doing something wrong in using the library.

Name and Version

version: 3549 (afd27f0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

No response

Relevant log output

No response

@cyanic-selkie cyanic-selkie added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Aug 8, 2024
@fairydreaming
Copy link
Collaborator

If you upload your transformers model somewhere I can take a look.

@cyanic-selkie
Copy link
Author

cyanic-selkie commented Aug 9, 2024

@fairydreaming Sure, thanks.

Base model: https://huggingface.co/repetitio/flan-t5-small

LoRA: https://huggingface.co/repetitio/distilled-simplifier

Any of the gguf variants give the same result. The input is just any wikipedia-like paragraph.

@fairydreaming
Copy link
Collaborator

The problem is that the current T5 model implementation ignores LORA in attention matrices. But fixing this is easy, try this patch:

diff --git a/src/llama.cpp b/src/llama.cpp
index a7b1c9eb..33b53e60 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -13178,13 +13178,13 @@ struct llm_build_context {
 
                 // self-attention
                 {
-                    struct ggml_tensor * Qcur = ggml_mul_mat(ctx0, model.layers[il].wq_enc, cur);
+                    struct ggml_tensor * Qcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wq_enc, cur);
                     cb(Qcur, "Qcur", il);
 
-                    struct ggml_tensor * Kcur = ggml_mul_mat(ctx0, model.layers[il].wk_enc, cur);
+                    struct ggml_tensor * Kcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wk_enc, cur);
                     cb(Kcur, "Kcur", il);
 
-                    struct ggml_tensor * Vcur = ggml_mul_mat(ctx0, model.layers[il].wv_enc, cur);
+                    struct ggml_tensor * Vcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wv_enc, cur);
                     cb(Vcur, "Vcur", il);
 
                     Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);
@@ -13218,7 +13218,7 @@ struct llm_build_context {
 
                     ggml_build_forward_expand(gf, cur);
 
-                    cur = ggml_mul_mat(ctx0, model.layers[il].wo_enc, cur);
+                    cur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wo_enc, cur);
                     cb(cur, "kqv_out", il);
                 }
 
@@ -13292,13 +13292,13 @@ struct llm_build_context {
 
                 // self-attention
                 {
-                    struct ggml_tensor * Qcur = ggml_mul_mat(ctx0, model.layers[il].wq, cur);
+                    struct ggml_tensor * Qcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wq, cur);
                     cb(Qcur, "Qcur", il);
 
-                    struct ggml_tensor * Kcur = ggml_mul_mat(ctx0, model.layers[il].wk, cur);
+                    struct ggml_tensor * Kcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wk, cur);
                     cb(Kcur, "Kcur", il);
 
-                    struct ggml_tensor * Vcur = ggml_mul_mat(ctx0, model.layers[il].wv, cur);
+                    struct ggml_tensor * Vcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wv, cur);
                     cb(Vcur, "Vcur", il);
 
                     llm_build_kv_store(ctx0, hparams, cparams, kv_self, gf, Kcur, Vcur, n_tokens, kv_head, cb, il);
@@ -13345,7 +13345,7 @@ struct llm_build_context {
 
                     ggml_build_forward_expand(gf, cur);
 
-                    cur = ggml_mul_mat(ctx0, model.layers[il].wo, cur);
+                    cur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wo, cur);
                     cb(cur, "kqv_out", il);
                 }
 
@@ -13362,13 +13362,13 @@ struct llm_build_context {
 
                 // cross-attention
                 {
-                    struct ggml_tensor * Qcur = ggml_mul_mat(ctx0, model.layers[il].wq_cross, cur);
+                    struct ggml_tensor * Qcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wq_cross, cur);
                     cb(Qcur, "Qcur", il);
 
-                    struct ggml_tensor * Kcur = ggml_mul_mat(ctx0, model.layers[il].wk_cross, embd_enc);
+                    struct ggml_tensor * Kcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wk_cross, embd_enc);
                     cb(Kcur, "Kcur", il);
 
-                    struct ggml_tensor * Vcur = ggml_mul_mat(ctx0, model.layers[il].wv_cross, embd_enc);
+                    struct ggml_tensor * Vcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wv_cross, embd_enc);
                     cb(Vcur, "Vcur", il);
 
                     Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head,    n_tokens);
@@ -13397,7 +13397,7 @@ struct llm_build_context {
 
                     ggml_build_forward_expand(gf, cur);
 
-                    cur = ggml_mul_mat(ctx0, model.layers[il].wo_cross, cur);
+                    cur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wo_cross, cur);
                     cb(cur, "kqv_out", il);
                 }

Basically you have to replace every multiplication of a matrix ggml_mul_mat(ctx0, ...) that has lora adapter with corresponding llm_build_lora_mm(lctx, ctx0, ...). After this (I think I added this in all places, but check) the model generates special tokens as (hopefully) expected:

[1723207411] last: [ '<extra_id_0>':32099, ' Artificial':24714, ' intelligence':6123, ' is':19, ' ':3, 'exhibited':21102, ' by':57, ' machines':4096, ',':6, ' particularly':1989, ' computer':1218, ' systems':1002, '.':5, '<extra_id_0>':32099, ' Artificial':24714, ' intelligence':6123, ' is':19, ' ':3, 'a':9, ' field':1057, ' of':13, ' research':585, ' in':16, ' computer':1218, ' science':2056, '.':5, '<extra_id_0>':32099, ' Artificial':24714, ' intelligence':6123, ' is':19, ' ':3, 'a':9, ' field':1057, ' of':13, ' research':585, ' in':16, ' computer':1218, ' science':2056, '.':5, '<extra_id_0>':32099, ' Artificial':24714, ' intelligence':6123, ' is':19, ' ':3, 'a':9, ' field':1057, ' of':13, ' research':585, ' in':16, ' computer':1218, ' science':2056, '.':5, '<extra_id_0>':32099, ' Artificial':24714, ' intelligence':6123, ' is':19, ' ':3, 'a':9, ' field':1057, ' of':13, ' research':585, ' in':16, ' computer':1218, ' science':2056 ]

fairydreaming added a commit that referenced this issue Aug 9, 2024
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
@cyanic-selkie
Copy link
Author

@fairydreaming Sorry, could you share the command you used, please? I'm not able to reproduce your results. I'm running

./llama-cli --model <path_to_flan_t5_small.gguf> -p <prompt> -- lora <path_to_adapter.gguf>

@fairydreaming
Copy link
Collaborator

fairydreaming commented Aug 9, 2024

Sure thing:

./llama-cli --numa distribute -t 32 -m models/repetitio/flan-t5-small-f32-2.gguf --lora models/repetitio/adapter_model-f32-2.gguf -p "Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.[1] Such machines may be called AIs." --temp 0.01 -s 42 --special

I reconverted both ggufs from safetensors:

./convert_hf_to_gguf.py /mnt/md0/huggingface/hub/models--repetitio--flan-t5-small/snapshots/d456a1448a27c26bfebf9cd864056e1b9a993576/ --outfile models/repetitio/flan-t5-small-f32-2.gguf --outtype "f32"

./convert_lora_to_gguf.py --outfile models/repetitio/adapter_model-f32-2.gguf --outtype f32 --base /mnt/md0/huggingface/hub/models--repetitio--flan-t5-small/snapshots/d456a1448a27c26bfebf9cd864056e1b9a993576/ /mnt/md0/huggingface/hub/models--repetitio--distilled-simplifier/snapshots/b54eff46f3a8405a5ec1ecc3d84ac4f4f3c69234/

With these I get the same layer output values as in transformers.

The output is:

<pad><extra_id_0> Artificial intelligence is a field of research in computer science.<extra_id_0> Artificial intelligence is a field of research in computer science.<extra_id_0> Artificial intelligence is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.<extra_id_0> The research in computer science is a field of research in computer science.</s> [end of text]

@cyanic-selkie
Copy link
Author

Awesome, thanks!

arthw pushed a commit to arthw/llama.cpp that referenced this issue Nov 15, 2024
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this issue Nov 18, 2024
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants