-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add T5 (encoder-decoder) support #5763
Comments
@ggerganov Does this mean llama.cpp could support something like the new GritLM model which can handle both text representations and text generation? I tried the embedding sample with gritlm but the resulting embeddings don't look right. Some references: |
The issue is about different architecture (encoder + decoder). GritLM looks like decoder-only Mistral fine-tune, so it should already work. If you think the results are not OK, you can open an issue with steps to reproduce |
I am looking forward to this. |
@dranger003 Probably that's because GritLM uses 2 prompt templates, one is used only for text generation and one only for embedding. Can you try embedding with the template specified by the author? Feel free to open a dedicated issue to discuss in details. |
T5 support would be truly awesome, expanding opportunities for numerous enterprise use cases. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Any update on this issue? |
Any news on this one? :) |
I'm also waiting for T5 (i.e. encoder-decoder) support in llama.cpp. Why? Because I could not find any embeddable (C++, C or Rust) T5 implementation with KV cache, out-of-the-box quantization and grammar support. I wish I could help with the development, but this is currently out of my league. 🥲 |
Would be also nice for fast Image generation embeddings encoding, like in PixArt and the soon upcoming StableDiffusion3 on 12th of June (As they utilize T5) |
I have T5 working in llama.cpp, but the code needs to be cleaned up and it still uses additional header file (darts.h - Double-ARray Trie System, MIT license) needed by the unigram tokenizer implementation. Git diff if 2.5k lines long ;_; |
What functionality does |
@ggerganov It's a C++ header-only trie implementation. Currently it's used in three places:
While 1 and 3 could be replaced with naive string search, for 2 the trie is created based on precompiled_charsmap from SentencePiece tokenizer model. It's basically a binary blob containing pre-tokenization normalization rules. Some information about it is here. I didn't examine it in detail, so not sure yet if normalization rules can be applied without using the trie. |
Things are going better than expected - I managed to get rid of the |
I added a branch with my T5 implementation: https://github.com/fairydreaming/llama.cpp/tree/t5
shall result in:
I tried T5-small, T5-base and T5-large, they all seem to work OK. Also compared layer outputs of T5-small with transformers implementation, looks the same. Edit: forgot to mention that tests/test-c.c currently doesn't compile in the branch since I added some default argument values in headers. This is normal. ;) |
Very cool! I'm wondering about the extended |
@ggerganov Good advice, I did that and it definitely simplified things, also added is_encoding flag in the context to avoid passing additional parameters. I still need to research how batches work to properly support that. |
…milies (ggml-org#5763) * llama : add T5 model architecture, tensors and model header parameters * llama : add implementation of Unigram tokenizer with SentencePiece-like text normalization using precompiled charsmap --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The third (and final) PR is now merged. |
* llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Hi, Does this code land in llama.dll, cause I use llama-cpp-python which uses llama.dll Great job btw, thanks |
* llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@Sadeghi85 I suppose the new code is there, but to use encoder-decoder models like T5 you have to use new API functions: llama_model_has_encoder(), llama_encode(), llama_model_decoder_start_token(). So I think you have to wait until the llama-cpp-python author (@abetlen) adds support for this. |
* llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
any news regarding llama-server support? |
@0x4139 I can help you with that. I left you a msg on fb mess. |
any news regarding llama-server support? |
@0x4139 @kaismh I started working on the server support in my branch: https://github.com/fairydreaming/llama.cpp/tree/t5-server If you are interested check it out and let me know if you have any problems. |
Any news on llama-cpp-python @abetlen? Real need for the support of the encoder-decoder T5 code. |
@fairydreaming I compiled your branch and started it with I also try to use a better T5 model for translation, |
@corysus I forgot to mention that I only tried /completion API endpoint, I guess there's no point in using /v1/chat/completions as T5-like models are not chat models and they don't use chat templates. So if your frontend uses OpenAI-compatible Chat Completions API (/v1/chat/completions) then it most likely won't work correctly. |
@fairydreaming Ahhh yes, I totally forgot to test the completion API endpoint and now I've tested it and the |
@fairydreaming Did you encouter max token length sent for the server? I can't go over like 50 tokens even though context is larger and memory is plenty, I get cuda error. Other models seem to work fine! |
@GodComplexxx It seems to be a CUDA limitation, I use ggml_get_rows() to get relative position bias and CUDA implementation of ggml_get_rows() fails when src1 is longer that 2^16 (that happens for batches longer that 256 tokens). My pal DeepSeek says that's because CUDA's maximum grid dimensions for y and z axes are 65535 (2^16-1) on most GPUs (compute capability ≤ 6.x). |
Yes, the CUDA implementation for |
@JohannesGaessler I've been trying to switch the implementation to use 2D src1 to avoid this limitation, but I just can't wrap my head around how ggml_get_rows() is supposed to work for more than 1 src1 dimension. Why there is an assert: I want to do something quite intuitive: ggml_get_rows({32, 32, 1, 1}, {258, 258, 1, 1}) = {32, 258, 258, 1} instead of ggml_get_rows({32, 32, 1, 1}, {66564, 1, 1, 1}) = {32, 66564, 1, 1}. |
Just swap dimensions x and y in the CUDA grid (so just swap the numbers in the IIRC |
I guess that's one possible solution, but I'd like to understand what's going on in the implementation.
So I guess that ne02 == ne11 assert is there because it's a "batch" dimension of some sort (so that each batch has its own set of rows and set of indices). Considering this I tried to use ne12 dimension for my second dimension of 2d indices. But in the CPU implementation (
If I understand correctly we copy whole rows here of size nc, so @ggerganov Help? |
Yes, that's the intent.
Yes.
This is currently treated as a "batch-of-batches" dimension. This misaligns with your intent of |
@GodComplexxx Try this patch with changes that Johannes suggested, it seems to work:
Also I wonder if there's any T5-based model that is capable of generating very long outputs - would be useful for testing. |
Thanks this worked perfectly! |
Still not familiar with the details, but it seems it would be useful to support this architecture in
llama.cpp
. First, need to decide on the API and see what changes would be necessarySee discussion here: #247
The text was updated successfully, but these errors were encountered: