llama : add T5 (encoder-decoder) support #5763

ggerganov · 2024-02-28T11:24:59Z

Still not familiar with the details, but it seems it would be useful to support this architecture in llama.cpp. First, need to decide on the API and see what changes would be necessary

See discussion here: #247

The text was updated successfully, but these errors were encountered:

dranger003 · 2024-02-28T15:38:45Z

@ggerganov Does this mean llama.cpp could support something like the new GritLM model which can handle both text representations and text generation? I tried the embedding sample with gritlm but the resulting embeddings don't look right.

Some references:
https://github.com/ContextualAI/gritlm/blob/92025b16534712b31b3c4aaaf069350e222bd5f8/gritlm/gritlm.py#L93
https://huggingface.co/GritLM/GritLM-7B

ggerganov · 2024-02-28T15:51:29Z

The issue is about different architecture (encoder + decoder). GritLM looks like decoder-only Mistral fine-tune, so it should already work. If you think the results are not OK, you can open an issue with steps to reproduce

sorasoras · 2024-02-28T19:56:37Z

I am looking forward to this.
How many work would be needed to implement this？

ngxson · 2024-02-28T21:49:34Z

@dranger003 Probably that's because GritLM uses 2 prompt templates, one is used only for text generation and one only for embedding. Can you try embedding with the template specified by the author?

Feel free to open a dedicated issue to discuss in details.

dranger003 · 2024-02-28T22:24:04Z

@ngxson thanks, I used the proper template. I opened an issue with a sample program.

Mihaiii · 2024-03-01T22:58:05Z

T5 support would be truly awesome, expanding opportunities for numerous enterprise use cases.

github-actions · 2024-04-17T01:06:28Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

nooobkevin · 2024-04-30T07:47:28Z

Any update on this issue?

Mihaiii · 2024-05-20T07:10:49Z

Any news on this one? :)

vladfaust · 2024-05-20T10:32:50Z

I'm also waiting for T5 (i.e. encoder-decoder) support in llama.cpp. Why? Because I could not find any embeddable (C++, C or Rust) T5 implementation with KV cache, out-of-the-box quantization and grammar support. I wish I could help with the development, but this is currently out of my league. 🥲

kabachuha · 2024-06-03T15:03:55Z

Would be also nice for fast Image generation embeddings encoding, like in PixArt and the soon upcoming StableDiffusion3 on 12th of June (As they utilize T5)

fairydreaming · 2024-06-09T20:02:48Z

I have T5 working in llama.cpp, but the code needs to be cleaned up and it still uses additional header file (darts.h - Double-ARray Trie System, MIT license) needed by the unigram tokenizer implementation. Git diff if 2.5k lines long ;_;

ggerganov · 2024-06-10T06:37:30Z

What functionality does darts.h provide? If it is just for performance string searches, we can replace it with some basic naive implementation for start

fairydreaming · 2024-06-10T07:31:02Z

What functionality does darts.h provide? If it is just for performance string searches, we can replace it with some basic naive implementation for start

@ggerganov It's a C++ header-only trie implementation. Currently it's used in three places:

Finding user-defined tokens during input normalization, so they won't be normalized
Normalization of input before tokenization
Finding tokens during tokenization

While 1 and 3 could be replaced with naive string search, for 2 the trie is created based on precompiled_charsmap from SentencePiece tokenizer model. It's basically a binary blob containing pre-tokenization normalization rules. Some information about it is here. I didn't examine it in detail, so not sure yet if normalization rules can be applied without using the trie.

fairydreaming · 2024-06-11T19:21:31Z

Things are going better than expected - I managed to get rid of the darts.h dependency and implement necessary functionality. My naive trie implementation is 2x slower compared to darts.h and more memory-hungry, but I guess we can build from that. Still have some code to rewrite, but shouldn't take as long as I initially thought.

fairydreaming · 2024-06-13T20:35:06Z

I added a branch with my T5 implementation: https://github.com/fairydreaming/llama.cpp/tree/t5
This is still a work in progress. For now I modified main.cpp to include llama_encode() call and pass computed encoder embeddings to llama_decode(), so you can test it with llama-cli command if you want:

./llama-cli -m models/t5-small.gguf -p 'translate English to German: The house is wonderful.'

shall result in:

...
system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0


 Das Haus ist wunderbar. [end of text]

llama_print_timings:        load time =      19.07 ms
llama_print_timings:      sample time =       0.10 ms /     6 runs   (    0.02 ms per token, 63157.89 tokens per second)
llama_print_timings: prompt eval time =       4.19 ms /    11 tokens (    0.38 ms per token,  2625.93 tokens per second)
llama_print_timings:        eval time =      12.62 ms /     6 runs   (    2.10 ms per token,   475.62 tokens per second)
llama_print_timings:       total time =      32.13 ms /    17 tokens
Log end

I tried T5-small, T5-base and T5-large, they all seem to work OK. Also compared layer outputs of T5-small with transformers implementation, looks the same.

Edit: forgot to mention that tests/test-c.c currently doesn't compile in the branch since I added some default argument values in headers. This is normal. ;)

ggerganov · 2024-06-14T08:31:03Z

Very cool!

I'm wondering about the extended llama_batch with the n_enc_output and enc_output members. Is there some way in which the enc_output is never presented to the user and remains internally within the llama_context. Looking for ways to simplify the interface. If the encoded embeddings remain within the context, then we don't have to explicitly pass them later to llama_decode.

fairydreaming · 2024-06-14T11:14:43Z

@ggerganov Good advice, I did that and it definitely simplified things, also added is_encoding flag in the context to avoid passing additional parameters. I still need to research how batches work to properly support that.

…milies (ggml-org#5763) * llama : add T5 model architecture, tensors and model header parameters * llama : add implementation of Unigram tokenizer with SentencePiece-like text normalization using precompiled charsmap --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

fairydreaming · 2024-07-04T18:32:14Z

The third (and final) PR is now merged.
TODO some day: add support for encoder-decoder models in llama-server.

* llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Sadeghi85 · 2024-07-05T00:52:24Z

The third (and final) PR is now merged. TODO some day: add support for encoder-decoder models in llama-server.

@abetlen
@fairydreaming

Hi,

Does this code land in llama.dll, cause I use llama-cpp-python which uses llama.dll

Great job btw, thanks

* llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

fairydreaming · 2024-07-05T06:35:29Z

Does this code land in llama.dll, cause I use llama-cpp-python which uses llama.dll

@Sadeghi85 I suppose the new code is there, but to use encoder-decoder models like T5 you have to use new API functions: llama_model_has_encoder(), llama_encode(), llama_model_decoder_start_token(). So I think you have to wait until the llama-cpp-python author (@abetlen) adds support for this.

* llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

0x4139 · 2024-10-30T19:25:01Z

any news regarding llama-server support?

Mihaiii · 2024-11-03T21:07:10Z

any news regarding llama-server support?

@0x4139 I can help you with that. I left you a msg on fb mess.

kaismh · 2024-11-07T08:56:42Z

any news regarding llama-server support?

fairydreaming · 2024-12-03T20:52:55Z

@0x4139 @kaismh I started working on the server support in my branch: https://github.com/fairydreaming/llama.cpp/tree/t5-server

If you are interested check it out and let me know if you have any problems.

maurlco · 2024-12-12T17:08:02Z

Any news on llama-cpp-python @abetlen? Real need for the support of the encoder-decoder T5 code.
Thanks a lot @fairydreaming for integrating the T5 structure!

corysus · 2024-12-14T14:16:03Z

@0x4139 @kaismh I started working on the server support in my branch: https://github.com/fairydreaming/llama.cpp/tree/t5-server

If you are interested check it out and let me know if you have any problems.

@fairydreaming I compiled your branch and started it with llama-server -m /models/flan-t5-small-q4_k_m.gguf -c 2048 --host 0.0.0.0
Then try to test it with a prompt: translate English to German: The house is wonderful. and it sometimes works and sometimes doesn't where return "|im_start|>assistant" and similar text.

I also try to use a better T5 model for translation, madlad400-3b-mt-q4_k_m.gguf, and it works with CLI perfectly, but not with llama-server :( Not sure, but probably because of the chat template, I get this message on loading llama-server with the madlad400 model: main: The chat template that comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses.

fairydreaming · 2024-12-14T14:42:27Z

@corysus I forgot to mention that I only tried /completion API endpoint, I guess there's no point in using /v1/chat/completions as T5-like models are not chat models and they don't use chat templates. So if your frontend uses OpenAI-compatible Chat Completions API (/v1/chat/completions) then it most likely won't work correctly.

corysus · 2024-12-14T14:53:12Z

@fairydreaming Ahhh yes, I totally forgot to test the completion API endpoint and now I've tested it and the madlad400 model works very well. Thank you for this update.

GodComplexxx · 2025-02-12T18:36:08Z

@fairydreaming Did you encouter max token length sent for the server? I can't go over like 50 tokens even though context is larger and memory is plenty, I get cuda error. Other models seem to work fine!

fairydreaming · 2025-02-13T09:42:45Z

@GodComplexxx It seems to be a CUDA limitation, I use ggml_get_rows() to get relative position bias and CUDA implementation of ggml_get_rows() fails when src1 is longer that 2^16 (that happens for batches longer that 256 tokens). My pal DeepSeek says that's because CUDA's maximum grid dimensions for y and z axes are 65535 (2^16-1) on most GPUs (compute capability ≤ 6.x).
@JohannesGaessler do you confirm that?

JohannesGaessler · 2025-02-14T08:24:51Z

Yes, the CUDA implementation for GET_ROWS has dst->ne[1] blocks in the y dimension so if that number is > 65535 the kernel launch fails.

fairydreaming · 2025-02-15T08:12:01Z

Yes, the CUDA implementation for GET_ROWS has dst->ne[1] blocks in the y dimension so if that number is > 65535 the kernel launch fails.

@JohannesGaessler I've been trying to switch the implementation to use 2D src1 to avoid this limitation, but I just can't wrap my head around how ggml_get_rows() is supposed to work for more than 1 src1 dimension. Why there is an assert: GGML_ASSERT(a->ne[2] == b->ne[1]); in the implementation? Is there any documentation or examples of this?

I want to do something quite intuitive: ggml_get_rows({32, 32, 1, 1}, {258, 258, 1, 1}) = {32, 258, 258, 1} instead of ggml_get_rows({32, 32, 1, 1}, {66564, 1, 1, 1}) = {32, 66564, 1, 1}.

JohannesGaessler · 2025-02-15T09:05:56Z

Just swap dimensions x and y in the CUDA grid (so just swap the numbers in the dim3 for the number of blocks and blockIdx.x and blockIdx.y in the code). The x dimension uses 32 instead of 16 bit and I don't think we currently have any cases where the size of ne0 a limiting factor.

IIRC GET_ROWS takes rows from src0 with indices defined in src1.

fairydreaming · 2025-02-15T10:07:54Z

Just swap dimensions x and y in the CUDA grid (so just swap the numbers in the dim3 for the number of blocks and blockIdx.x and blockIdx.y in the code). The x dimension uses 32 instead of 16 bit and I don't think we currently have any cases where the size of ne0 a limiting factor.

IIRC GET_ROWS takes rows from src0 with indices defined in src1.

I guess that's one possible solution, but I'd like to understand what's going on in the implementation.
I've looked at ggml_get_rows() tests and they are like:

        ggml_tensor * in = ggml_new_tensor_3d(ctx, type, n, m, b);
        ggml_set_name(in, "in");

        ggml_tensor * rows = ggml_new_tensor_2d(ctx, GGML_TYPE_I32, r, b);
        ggml_set_name(rows, "rows");

        ggml_tensor * out = ggml_get_rows(ctx, in, rows);
        ggml_set_name(out, "out");

So I guess that ne02 == ne11 assert is there because it's a "batch" dimension of some sort (so that each batch has its own set of rows and set of indices). Considering this I tried to use ne12 dimension for my second dimension of 2d indices. But in the CPU implementation (ggml_compute_forward_get_rows_f32 for simplicity) there is:

        ...
        ggml_vec_cpy_f32(nc,
                (float *) ((char *)  dst->data + i10*nb1  + i11*nb2  + i12*nb3),
                (float *) ((char *) src0->data + i01*nb01 + i11*nb02 + i12*nb03));
        ...

If I understand correctly we copy whole rows here of size nc, so i01*nb01 moves the src0->data pointer to the current row to copy, i11*nb02 moves the poiner to the current "batch" and i12*nb3... I have no idea what's the purpose of this. I mean i12 is calculated based on the src1 dimensions, so why is it used also for calculating src0 pointer? (I think it moves the address outside the src0 size and that's why my code doesn't work)

@ggerganov Help?

ggerganov · 2025-02-15T10:28:55Z

So I guess that ne02 == ne11 assert is there because it's a "batch" dimension of some sort (so that each batch has its own set of rows and set of indices).

Yes, that's the intent.

i01*nb01 moves the src0->data pointer to the current row to copy,

Yes.

i11nb02 moves the poiner to the current "batch" and i12nb3

This is currently treated as a "batch-of-batches" dimension. This misaligns with your intent of ggml_get_rows({32, 32, 1, 1}, {258, 258, 1, 1}) = {32, 258, 258, 1}. You should instead use ggml_get_rows({32, 32, 1, 1}, {66564, 1, 1, 1}) = {32, 66564, 1, 1}. and then reshape the result to {32, 258, 258, 1} if that is what you need. But AFAIU, the CUDA implementation has a limit for the number of rows. Seems like we should fix that in the CUDA implementation, instead of changing the behaviour of the operator.

fairydreaming · 2025-02-15T13:00:20Z

@GodComplexxx Try this patch with changes that Johannes suggested, it seems to work:

diff --git a/ggml/src/ggml-cuda/getrows.cu b/ggml/src/ggml-cuda/getrows.cu
index 4c370323..dad1e15f 100644
--- a/ggml/src/ggml-cuda/getrows.cu
+++ b/ggml/src/ggml-cuda/getrows.cu
@@ -10,8 +10,8 @@ static __global__ void k_get_rows(
             /*size_t nb00,*/ size_t nb01, size_t nb02, size_t nb03,
             size_t s10, size_t s11, size_t s12/*, size_t s13*/) {
 
-    const int i00 = (blockIdx.x*blockDim.x + threadIdx.x)*2;
-    const int i10 = blockDim.y*blockIdx.y + threadIdx.y;
+    const int i00 = (blockIdx.y*blockDim.x + threadIdx.x)*2;
+    const int i10 = blockDim.y*blockIdx.x + threadIdx.y;
     const int i11 = (blockIdx.z*blockDim.z + threadIdx.z)/ne12;
     const int i12 = (blockIdx.z*blockDim.z + threadIdx.z)%ne12;
 
@@ -46,8 +46,8 @@ static __global__ void k_get_rows_float(
             /*size_t nb00,*/ size_t nb01, size_t nb02, size_t nb03,
             size_t s10, size_t s11, size_t s12/*, size_t s13*/) {
 
-    const int i00 = blockIdx.x*blockDim.x + threadIdx.x;
-    const int i10 = blockDim.y*blockIdx.y + threadIdx.y;
+    const int i00 = blockIdx.y*blockDim.x + threadIdx.x;
+    const int i10 = blockDim.y*blockIdx.x + threadIdx.y;
     const int i11 = (blockIdx.z*blockDim.z + threadIdx.z)/ne12;
     const int i12 = (blockIdx.z*blockDim.z + threadIdx.z)%ne12;
 
@@ -71,7 +71,7 @@ static void get_rows_cuda(const ggml_tensor * src0, const ggml_tensor * src1, gg
 
     const dim3 block_dims(CUDA_GET_ROWS_BLOCK_SIZE, 1, 1);
     const int block_num_x = (ne00 + 2*CUDA_GET_ROWS_BLOCK_SIZE - 1) / (2*CUDA_GET_ROWS_BLOCK_SIZE);
-    const dim3 block_nums(block_num_x, ne10, ne11*ne12);
+    const dim3 block_nums(ne10, block_num_x, ne11*ne12);
 
     // strides in elements
     //const size_t s0 = nb0 / ggml_element_size(dst);
@@ -105,7 +105,7 @@ static void get_rows_cuda_float(const ggml_tensor * src0, const ggml_tensor * sr
 
     const dim3 block_dims(CUDA_GET_ROWS_BLOCK_SIZE, 1, 1);
     const int block_num_x = (ne00 + CUDA_GET_ROWS_BLOCK_SIZE - 1) / CUDA_GET_ROWS_BLOCK_SIZE;
-    const dim3 block_nums(block_num_x, ne10, ne11*ne12);
+    const dim3 block_nums(ne10, block_num_x, ne11*ne12);
 
     // strides in elements
     //const size_t s0 = nb0 / ggml_element_size(dst);

Also I wonder if there's any T5-based model that is capable of generating very long outputs - would be useful for testing.

GodComplexxx · 2025-02-18T06:12:13Z

Thanks this worked perfectly!

ggerganov added the model Model specific label Feb 28, 2024

ggerganov added this to ggml : roadmap Feb 28, 2024

ggerganov moved this to Todo in ggml : roadmap Feb 28, 2024

easp mentioned this issue Feb 29, 2024

Support for the new 450 language translation models from Google T5X "madlad" - apparently Apache-2 #4316

Closed

github-actions bot added the stale label Apr 1, 2024

github-actions bot closed this as completed Apr 17, 2024

ngxson reopened this Apr 21, 2024

ngxson mentioned this issue Apr 21, 2024

Time Series Forecasting LLM Support #6791

Closed

github-actions bot removed the stale label Apr 22, 2024

fairydreaming self-assigned this Jun 9, 2024

onurgu mentioned this issue Jun 14, 2024

Feature Request: GGUF format support boun-tabi-LMG/turkish-lm-tuner#69

Open

ggerganov moved this from Todo to In Progress in ggml : roadmap Jun 14, 2024

fairydreaming closed this as completed in #8141 Jul 4, 2024

ggerganov moved this from In Progress to Done in ggml : roadmap Jul 4, 2024

fabiomatricardi mentioned this issue Aug 14, 2024

Add support T5 (encode-decoder) models at API level and server abetlen/llama-cpp-python#1681

Open

llama : add T5 (encoder-decoder) support #5763

llama : add T5 (encoder-decoder) support #5763

Comments

ggerganov commented Feb 28, 2024

dranger003 commented Feb 28, 2024

ggerganov commented Feb 28, 2024

sorasoras commented Feb 28, 2024

ngxson commented Feb 28, 2024

dranger003 commented Feb 28, 2024

Mihaiii commented Mar 1, 2024

github-actions bot commented Apr 17, 2024

nooobkevin commented Apr 30, 2024

Mihaiii commented May 20, 2024

vladfaust commented May 20, 2024

kabachuha commented Jun 3, 2024 • edited Loading

fairydreaming commented Jun 9, 2024

ggerganov commented Jun 10, 2024

fairydreaming commented Jun 10, 2024

fairydreaming commented Jun 11, 2024

fairydreaming commented Jun 13, 2024 • edited Loading

ggerganov commented Jun 14, 2024 • edited Loading

fairydreaming commented Jun 14, 2024

fairydreaming commented Jul 4, 2024

Sadeghi85 commented Jul 5, 2024

fairydreaming commented Jul 5, 2024

0x4139 commented Oct 30, 2024

Mihaiii commented Nov 3, 2024

kaismh commented Nov 7, 2024

fairydreaming commented Dec 3, 2024

maurlco commented Dec 12, 2024

corysus commented Dec 14, 2024

fairydreaming commented Dec 14, 2024

corysus commented Dec 14, 2024

GodComplexxx commented Feb 12, 2025 • edited Loading

fairydreaming commented Feb 13, 2025 • edited Loading

JohannesGaessler commented Feb 14, 2025

fairydreaming commented Feb 15, 2025

JohannesGaessler commented Feb 15, 2025

fairydreaming commented Feb 15, 2025

ggerganov commented Feb 15, 2025

fairydreaming commented Feb 15, 2025

GodComplexxx commented Feb 18, 2025

kabachuha commented Jun 3, 2024 •

edited

Loading

fairydreaming commented Jun 13, 2024 •

edited

Loading

ggerganov commented Jun 14, 2024 •

edited

Loading

GodComplexxx commented Feb 12, 2025 •

edited

Loading

fairydreaming commented Feb 13, 2025 •

edited

Loading