Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add T5 (encoder-decoder) support #5763

Closed
ggerganov opened this issue Feb 28, 2024 · 46 comments · Fixed by #8141
Closed

llama : add T5 (encoder-decoder) support #5763

ggerganov opened this issue Feb 28, 2024 · 46 comments · Fixed by #8141
Assignees
Labels
model Model specific

Comments

@ggerganov
Copy link
Member

Still not familiar with the details, but it seems it would be useful to support this architecture in llama.cpp. First, need to decide on the API and see what changes would be necessary

See discussion here: #247

@ggerganov ggerganov added the model Model specific label Feb 28, 2024
@ggerganov ggerganov moved this to Todo in ggml : roadmap Feb 28, 2024
@dranger003
Copy link
Contributor

@ggerganov Does this mean llama.cpp could support something like the new GritLM model which can handle both text representations and text generation? I tried the embedding sample with gritlm but the resulting embeddings don't look right.

Some references:
https://github.com/ContextualAI/gritlm/blob/92025b16534712b31b3c4aaaf069350e222bd5f8/gritlm/gritlm.py#L93
https://huggingface.co/GritLM/GritLM-7B

@ggerganov
Copy link
Member Author

The issue is about different architecture (encoder + decoder). GritLM looks like decoder-only Mistral fine-tune, so it should already work. If you think the results are not OK, you can open an issue with steps to reproduce

@sorasoras
Copy link

I am looking forward to this.
How many work would be needed to implement this?

@ngxson
Copy link
Collaborator

ngxson commented Feb 28, 2024

@dranger003 Probably that's because GritLM uses 2 prompt templates, one is used only for text generation and one only for embedding. Can you try embedding with the template specified by the author?

Feel free to open a dedicated issue to discuss in details.

@dranger003
Copy link
Contributor

@ngxson thanks, I used the proper template. I opened an issue with a sample program.

@Mihaiii
Copy link
Contributor

Mihaiii commented Mar 1, 2024

T5 support would be truly awesome, expanding opportunities for numerous enterprise use cases.

@github-actions github-actions bot added the stale label Apr 1, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@nooobkevin
Copy link

Any update on this issue?

@Mihaiii
Copy link
Contributor

Mihaiii commented May 20, 2024

Any news on this one? :)

@vladfaust
Copy link

I'm also waiting for T5 (i.e. encoder-decoder) support in llama.cpp. Why? Because I could not find any embeddable (C++, C or Rust) T5 implementation with KV cache, out-of-the-box quantization and grammar support. I wish I could help with the development, but this is currently out of my league. 🥲

@kabachuha
Copy link

kabachuha commented Jun 3, 2024

Would be also nice for fast Image generation embeddings encoding, like in PixArt and the soon upcoming StableDiffusion3 on 12th of June (As they utilize T5)

@fairydreaming
Copy link
Collaborator

I have T5 working in llama.cpp, but the code needs to be cleaned up and it still uses additional header file (darts.h - Double-ARray Trie System, MIT license) needed by the unigram tokenizer implementation. Git diff if 2.5k lines long ;_;

@fairydreaming fairydreaming self-assigned this Jun 9, 2024
@ggerganov
Copy link
Member Author

What functionality does darts.h provide? If it is just for performance string searches, we can replace it with some basic naive implementation for start

@fairydreaming
Copy link
Collaborator

What functionality does darts.h provide? If it is just for performance string searches, we can replace it with some basic naive implementation for start

@ggerganov It's a C++ header-only trie implementation. Currently it's used in three places:

  1. Finding user-defined tokens during input normalization, so they won't be normalized
  2. Normalization of input before tokenization
  3. Finding tokens during tokenization

While 1 and 3 could be replaced with naive string search, for 2 the trie is created based on precompiled_charsmap from SentencePiece tokenizer model. It's basically a binary blob containing pre-tokenization normalization rules. Some information about it is here. I didn't examine it in detail, so not sure yet if normalization rules can be applied without using the trie.

@fairydreaming
Copy link
Collaborator

Things are going better than expected - I managed to get rid of the darts.h dependency and implement necessary functionality. My naive trie implementation is 2x slower compared to darts.h and more memory-hungry, but I guess we can build from that. Still have some code to rewrite, but shouldn't take as long as I initially thought.

@fairydreaming
Copy link
Collaborator

fairydreaming commented Jun 13, 2024

I added a branch with my T5 implementation: https://github.com/fairydreaming/llama.cpp/tree/t5
This is still a work in progress. For now I modified main.cpp to include llama_encode() call and pass computed encoder embeddings to llama_decode(), so you can test it with llama-cli command if you want:

./llama-cli -m models/t5-small.gguf -p 'translate English to German: The house is wonderful.'

shall result in:

...
system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0


 Das Haus ist wunderbar. [end of text]

llama_print_timings:        load time =      19.07 ms
llama_print_timings:      sample time =       0.10 ms /     6 runs   (    0.02 ms per token, 63157.89 tokens per second)
llama_print_timings: prompt eval time =       4.19 ms /    11 tokens (    0.38 ms per token,  2625.93 tokens per second)
llama_print_timings:        eval time =      12.62 ms /     6 runs   (    2.10 ms per token,   475.62 tokens per second)
llama_print_timings:       total time =      32.13 ms /    17 tokens
Log end

I tried T5-small, T5-base and T5-large, they all seem to work OK. Also compared layer outputs of T5-small with transformers implementation, looks the same.

Edit: forgot to mention that tests/test-c.c currently doesn't compile in the branch since I added some default argument values in headers. This is normal. ;)

@ggerganov
Copy link
Member Author

ggerganov commented Jun 14, 2024

Very cool!

I'm wondering about the extended llama_batch with the n_enc_output and enc_output members. Is there some way in which the enc_output is never presented to the user and remains internally within the llama_context. Looking for ways to simplify the interface. If the encoded embeddings remain within the context, then we don't have to explicitly pass them later to llama_decode.

@fairydreaming
Copy link
Collaborator

@ggerganov Good advice, I did that and it definitely simplified things, also added is_encoding flag in the context to avoid passing additional parameters. I still need to research how batches work to properly support that.

MagnusS0 pushed a commit to MagnusS0/llama.cpp-normistral-tokenizer that referenced this issue Jul 1, 2024
…milies (ggml-org#5763)

* llama : add T5 model architecture, tensors and model header parameters

* llama : add implementation of Unigram tokenizer with SentencePiece-like text normalization using precompiled charsmap

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
fairydreaming added a commit that referenced this issue Jul 4, 2024
* llama : add inference support and model types for T5 and FLAN-T5 model families

* llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

* common, llama-cli, llama-batched : add support for encoder-decoder models

* convert-hf : handle shared token embeddings tensors in T5Model

* convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

* convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

* convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@fairydreaming
Copy link
Collaborator

The third (and final) PR is now merged.
TODO some day: add support for encoder-decoder models in llama-server.

@ggerganov ggerganov moved this from In Progress to Done in ggml : roadmap Jul 4, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 4, 2024
* llama : add inference support and model types for T5 and FLAN-T5 model families

* llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

* common, llama-cli, llama-batched : add support for encoder-decoder models

* convert-hf : handle shared token embeddings tensors in T5Model

* convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

* convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

* convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@Sadeghi85
Copy link

The third (and final) PR is now merged. TODO some day: add support for encoder-decoder models in llama-server.

@abetlen
@fairydreaming

Hi,

Does this code land in llama.dll, cause I use llama-cpp-python which uses llama.dll

Great job btw, thanks

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 5, 2024
* llama : add inference support and model types for T5 and FLAN-T5 model families

* llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

* common, llama-cli, llama-batched : add support for encoder-decoder models

* convert-hf : handle shared token embeddings tensors in T5Model

* convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

* convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

* convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@fairydreaming
Copy link
Collaborator

Does this code land in llama.dll, cause I use llama-cpp-python which uses llama.dll

@Sadeghi85 I suppose the new code is there, but to use encoder-decoder models like T5 you have to use new API functions: llama_model_has_encoder(), llama_encode(), llama_model_decoder_start_token(). So I think you have to wait until the llama-cpp-python author (@abetlen) adds support for this.

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 6, 2024
* llama : add inference support and model types for T5 and FLAN-T5 model families

* llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

* common, llama-cli, llama-batched : add support for encoder-decoder models

* convert-hf : handle shared token embeddings tensors in T5Model

* convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

* convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

* convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this issue Jul 7, 2024
* llama : add inference support and model types for T5 and FLAN-T5 model families

* llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

* common, llama-cli, llama-batched : add support for encoder-decoder models

* convert-hf : handle shared token embeddings tensors in T5Model

* convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

* convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

* convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 11, 2024
* llama : add inference support and model types for T5 and FLAN-T5 model families

* llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

* common, llama-cli, llama-batched : add support for encoder-decoder models

* convert-hf : handle shared token embeddings tensors in T5Model

* convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

* convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

* convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@0x4139
Copy link
Contributor

0x4139 commented Oct 30, 2024

any news regarding llama-server support?

@Mihaiii
Copy link
Contributor

Mihaiii commented Nov 3, 2024

any news regarding llama-server support?

@0x4139 I can help you with that. I left you a msg on fb mess.

@kaismh
Copy link

kaismh commented Nov 7, 2024

any news regarding llama-server support?

@fairydreaming
Copy link
Collaborator

@0x4139 @kaismh I started working on the server support in my branch: https://github.com/fairydreaming/llama.cpp/tree/t5-server

If you are interested check it out and let me know if you have any problems.

@maurlco
Copy link

maurlco commented Dec 12, 2024

Any news on llama-cpp-python @abetlen? Real need for the support of the encoder-decoder T5 code.
Thanks a lot @fairydreaming for integrating the T5 structure!

@corysus
Copy link

corysus commented Dec 14, 2024

@0x4139 @kaismh I started working on the server support in my branch: https://github.com/fairydreaming/llama.cpp/tree/t5-server

If you are interested check it out and let me know if you have any problems.

@fairydreaming I compiled your branch and started it with llama-server -m /models/flan-t5-small-q4_k_m.gguf -c 2048 --host 0.0.0.0
Then try to test it with a prompt: translate English to German: The house is wonderful. and it sometimes works and sometimes doesn't where return "|im_start|>assistant" and similar text.

I also try to use a better T5 model for translation, madlad400-3b-mt-q4_k_m.gguf, and it works with CLI perfectly, but not with llama-server :( Not sure, but probably because of the chat template, I get this message on loading llama-server with the madlad400 model: main: The chat template that comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses.

@fairydreaming
Copy link
Collaborator

@corysus I forgot to mention that I only tried /completion API endpoint, I guess there's no point in using /v1/chat/completions as T5-like models are not chat models and they don't use chat templates. So if your frontend uses OpenAI-compatible Chat Completions API (/v1/chat/completions) then it most likely won't work correctly.

@corysus
Copy link

corysus commented Dec 14, 2024

@fairydreaming Ahhh yes, I totally forgot to test the completion API endpoint and now I've tested it and the madlad400 model works very well. Thank you for this update.

@GodComplexxx
Copy link

GodComplexxx commented Feb 12, 2025

@fairydreaming Did you encouter max token length sent for the server? I can't go over like 50 tokens even though context is larger and memory is plenty, I get cuda error. Other models seem to work fine!

@fairydreaming
Copy link
Collaborator

fairydreaming commented Feb 13, 2025

@GodComplexxx It seems to be a CUDA limitation, I use ggml_get_rows() to get relative position bias and CUDA implementation of ggml_get_rows() fails when src1 is longer that 2^16 (that happens for batches longer that 256 tokens). My pal DeepSeek says that's because CUDA's maximum grid dimensions for y and z axes are 65535 (2^16-1) on most GPUs (compute capability ≤ 6.x).
@JohannesGaessler do you confirm that?

@JohannesGaessler
Copy link
Collaborator

Yes, the CUDA implementation for GET_ROWS has dst->ne[1] blocks in the y dimension so if that number is > 65535 the kernel launch fails.

@fairydreaming
Copy link
Collaborator

Yes, the CUDA implementation for GET_ROWS has dst->ne[1] blocks in the y dimension so if that number is > 65535 the kernel launch fails.

@JohannesGaessler I've been trying to switch the implementation to use 2D src1 to avoid this limitation, but I just can't wrap my head around how ggml_get_rows() is supposed to work for more than 1 src1 dimension. Why there is an assert: GGML_ASSERT(a->ne[2] == b->ne[1]); in the implementation? Is there any documentation or examples of this?

I want to do something quite intuitive: ggml_get_rows({32, 32, 1, 1}, {258, 258, 1, 1}) = {32, 258, 258, 1} instead of ggml_get_rows({32, 32, 1, 1}, {66564, 1, 1, 1}) = {32, 66564, 1, 1}.

@JohannesGaessler
Copy link
Collaborator

Just swap dimensions x and y in the CUDA grid (so just swap the numbers in the dim3 for the number of blocks and blockIdx.x and blockIdx.y in the code). The x dimension uses 32 instead of 16 bit and I don't think we currently have any cases where the size of ne0 a limiting factor.

IIRC GET_ROWS takes rows from src0 with indices defined in src1.

@fairydreaming
Copy link
Collaborator

Just swap dimensions x and y in the CUDA grid (so just swap the numbers in the dim3 for the number of blocks and blockIdx.x and blockIdx.y in the code). The x dimension uses 32 instead of 16 bit and I don't think we currently have any cases where the size of ne0 a limiting factor.

IIRC GET_ROWS takes rows from src0 with indices defined in src1.

I guess that's one possible solution, but I'd like to understand what's going on in the implementation.
I've looked at ggml_get_rows() tests and they are like:

        ggml_tensor * in = ggml_new_tensor_3d(ctx, type, n, m, b);
        ggml_set_name(in, "in");

        ggml_tensor * rows = ggml_new_tensor_2d(ctx, GGML_TYPE_I32, r, b);
        ggml_set_name(rows, "rows");

        ggml_tensor * out = ggml_get_rows(ctx, in, rows);
        ggml_set_name(out, "out");

So I guess that ne02 == ne11 assert is there because it's a "batch" dimension of some sort (so that each batch has its own set of rows and set of indices). Considering this I tried to use ne12 dimension for my second dimension of 2d indices. But in the CPU implementation (ggml_compute_forward_get_rows_f32 for simplicity) there is:

        ...
        ggml_vec_cpy_f32(nc,
                (float *) ((char *)  dst->data + i10*nb1  + i11*nb2  + i12*nb3),
                (float *) ((char *) src0->data + i01*nb01 + i11*nb02 + i12*nb03));
        ...

If I understand correctly we copy whole rows here of size nc, so i01*nb01 moves the src0->data pointer to the current row to copy, i11*nb02 moves the poiner to the current "batch" and i12*nb3... I have no idea what's the purpose of this. I mean i12 is calculated based on the src1 dimensions, so why is it used also for calculating src0 pointer? (I think it moves the address outside the src0 size and that's why my code doesn't work)

@ggerganov Help?

@ggerganov
Copy link
Member Author

So I guess that ne02 == ne11 assert is there because it's a "batch" dimension of some sort (so that each batch has its own set of rows and set of indices).

Yes, that's the intent.

i01*nb01 moves the src0->data pointer to the current row to copy,

Yes.

i11nb02 moves the poiner to the current "batch" and i12nb3

This is currently treated as a "batch-of-batches" dimension. This misaligns with your intent of ggml_get_rows({32, 32, 1, 1}, {258, 258, 1, 1}) = {32, 258, 258, 1}. You should instead use ggml_get_rows({32, 32, 1, 1}, {66564, 1, 1, 1}) = {32, 66564, 1, 1}. and then reshape the result to {32, 258, 258, 1} if that is what you need. But AFAIU, the CUDA implementation has a limit for the number of rows. Seems like we should fix that in the CUDA implementation, instead of changing the behaviour of the operator.

@fairydreaming
Copy link
Collaborator

@GodComplexxx Try this patch with changes that Johannes suggested, it seems to work:

diff --git a/ggml/src/ggml-cuda/getrows.cu b/ggml/src/ggml-cuda/getrows.cu
index 4c370323..dad1e15f 100644
--- a/ggml/src/ggml-cuda/getrows.cu
+++ b/ggml/src/ggml-cuda/getrows.cu
@@ -10,8 +10,8 @@ static __global__ void k_get_rows(
             /*size_t nb00,*/ size_t nb01, size_t nb02, size_t nb03,
             size_t s10, size_t s11, size_t s12/*, size_t s13*/) {
 
-    const int i00 = (blockIdx.x*blockDim.x + threadIdx.x)*2;
-    const int i10 = blockDim.y*blockIdx.y + threadIdx.y;
+    const int i00 = (blockIdx.y*blockDim.x + threadIdx.x)*2;
+    const int i10 = blockDim.y*blockIdx.x + threadIdx.y;
     const int i11 = (blockIdx.z*blockDim.z + threadIdx.z)/ne12;
     const int i12 = (blockIdx.z*blockDim.z + threadIdx.z)%ne12;
 
@@ -46,8 +46,8 @@ static __global__ void k_get_rows_float(
             /*size_t nb00,*/ size_t nb01, size_t nb02, size_t nb03,
             size_t s10, size_t s11, size_t s12/*, size_t s13*/) {
 
-    const int i00 = blockIdx.x*blockDim.x + threadIdx.x;
-    const int i10 = blockDim.y*blockIdx.y + threadIdx.y;
+    const int i00 = blockIdx.y*blockDim.x + threadIdx.x;
+    const int i10 = blockDim.y*blockIdx.x + threadIdx.y;
     const int i11 = (blockIdx.z*blockDim.z + threadIdx.z)/ne12;
     const int i12 = (blockIdx.z*blockDim.z + threadIdx.z)%ne12;
 
@@ -71,7 +71,7 @@ static void get_rows_cuda(const ggml_tensor * src0, const ggml_tensor * src1, gg
 
     const dim3 block_dims(CUDA_GET_ROWS_BLOCK_SIZE, 1, 1);
     const int block_num_x = (ne00 + 2*CUDA_GET_ROWS_BLOCK_SIZE - 1) / (2*CUDA_GET_ROWS_BLOCK_SIZE);
-    const dim3 block_nums(block_num_x, ne10, ne11*ne12);
+    const dim3 block_nums(ne10, block_num_x, ne11*ne12);
 
     // strides in elements
     //const size_t s0 = nb0 / ggml_element_size(dst);
@@ -105,7 +105,7 @@ static void get_rows_cuda_float(const ggml_tensor * src0, const ggml_tensor * sr
 
     const dim3 block_dims(CUDA_GET_ROWS_BLOCK_SIZE, 1, 1);
     const int block_num_x = (ne00 + CUDA_GET_ROWS_BLOCK_SIZE - 1) / CUDA_GET_ROWS_BLOCK_SIZE;
-    const dim3 block_nums(block_num_x, ne10, ne11*ne12);
+    const dim3 block_nums(ne10, block_num_x, ne11*ne12);
 
     // strides in elements
     //const size_t s0 = nb0 / ggml_element_size(dst);

Also I wonder if there's any T5-based model that is capable of generating very long outputs - would be useful for testing.

@GodComplexxx
Copy link

Thanks this worked perfectly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific
Projects
None yet
Development

Successfully merging a pull request may close this issue.