How to use ggml for Flan-T5 #247

i-am-neo · 2023-03-17T22:38:08Z

@ggerganov Thanks for sharing llama.cpp. As usual, great work.

Question rather than issue. How difficult would it be to make ggml.c work for a Flan checkpoint, like T5-xl/UL2, then quantized?

Would love to be able to have those models run on a browser, much like what you did with whisper.cpp wasm.

Thanks again. (I can move this post somewhere else if you prefer since it's not technically about Llama. Just let me know where.)

alexconstant9108 · 2023-03-18T18:23:17Z

@i-am-neo just curious, have you found T5-xl/UL2 to be in any way superior to FB's llama models (even the smallest 7b model) for certain tasks? IIRC T5-xxl used to be the best open source model before llama, but now...?

michaelbogdan · 2023-03-19T02:04:59Z

Since LLaMA wasn't published under a FOSS license and explicitly does not allow commercial use, it shouldn't be considered an open source model.

i-am-neo · 2023-03-19T16:10:18Z

@i-am-neo just curious, have you found T5-xl/UL2 to be in any way superior to FB's llama models (even the smallest 7b model) for certain tasks? IIRC T5-xxl used to be the best open source model before llama, but now...?

@alexconstant9108 I have found Flan-T5 performant when one needs accurate answers to questions (no inventions allowed). This is from real-life data, details disguised for privacy. Flan-T5 was used in its recipe.

Flan-UL2 looks to be more "fluent"/expressive than Flan-T5, but I've just started to look.

I won't comment on text generation nor "general knowledge queries" since I don't use these models to write novels, nor do I care if they know about Batman and Robin.

What I found beautiful about Stanford Alpaca was that a 7B autoregressive model can be coaxed to follow instructions. The Llama paper mentions "Llama-I" which compared favorably to other instructed models at around 65B parameters). As to how well Alpaca does at 7B, we'll have to see (I've requested the Stanford team to run stats and hopefully they are willing to respond).

What I find beautiful about the idea of llama.cpp or the possibility of a Flan-T5.cpp is running a model locally, hopefully at some point on the browser. If one could shrink Flan-UL2 down to xB parameters and it still performs well... (Leche Flan anyone?).

I'm of the thinking a sequence-to-sequence model and an autogenerative are better harnessed to the best capabilities of each, depending on what one needs to accomplish, rather than pitted against each other on a one-on-one comparison.

In the end, as our field evolves, we all need to ask the question touched on in the Llama paper, italics mine - "The focus of this work is to train a series of
language models that achieve the best possible performance at various inference budgets, by ..."

alexconstant9108 · 2023-03-20T14:07:28Z

@i-am-neo
Very interesting. So in that setup did you fine-tune Flan-T5 on your own dataset so that it can answer those questions accurately? Also can you give any example question / answer pairs (with any sensitive data masked for privacy reasons, ofcourse )

There is an alpaca quantized version available for download somewhere on github if you look close enough :)

BTW regarding running flan-t5 (and llama) locally (not in the browser tho) you will be very excited to try https://bellard.org/ts_server/ . supports both CPU and GPU inference

MarkSchmidty · 2023-03-21T20:48:44Z

Since people in this thread are interested in Instruct models, I recommend checking out chatGLM-6B.

I believe it is more capable than Flan-UL2 in just 6B parameters. I have a one-click web demo of the 4bit GPU version here:

It's not getting much attention due to being the product of a Chinese University. But in my testing it outperforms every other open source instruct model while coming in at just 4GB, even smaller than Alpaca-7B.

Full repo is here: https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md

alexconstant9108 · 2023-03-21T20:53:17Z

@MarkSchmidty useful reference. Thanks
From what I have observed GLM is mostly ignored due to it being weaker with English prompts. But it may turn out to be better when conversing in Chinese for those who can speak it...
If you could provide a small comparison table of English prompts when using Alpaca, Flan-U2L and GLM-6B, that would open the eyes of many people for that particular Instruct model

MarkSchmidty · 2023-03-21T20:56:53Z

@MarkSchmidty useful reference. Thanks From what I have observed GLM is mostly ignored due to it being weaker with English prompts. But it may turn out to be better when conversing in Chinese for those who can speak it...

I have not had this experience. It seems excellent for English.
It is mostly ignored because the team behind it does very little to market it outside of China or to non-Chinese speakers.

>If you could provide a small comparison table of English prompts when using Alpaca, Flan-U2L and GLM-6B, that would open the eyes of many people for that particular Instruct model
It takes 30 seconds to launch and test my free web demo linked above.

alexconstant9108 · 2023-03-21T21:01:08Z

But in my testing it outperforms every other open source instruct model while coming in at just 4GB, even smaller than Alpaca-7B.

BTW, the readme says:

Quantization Level	GPU Memory
FP16（no quantization）	13 GB
INT8	10 GB
INT4	6 GB

How did you manage to shrink it to 4GB? Also is it using GPTQ or RTN quantization for 4-bit?
Edit, I saw that in another place, the authors mention 5.2GB CPU memory
and GPT-Q quantization. It would help if there docs had more English examples. Right now, it seems that they are targeting mostly the Chinese audience which is in line with what you said regarding how they "market" it...

MarkSchmidty · 2023-03-21T21:11:03Z

That is the GPU memory required to run inference not the model size.

The official int4 model is 4.06GB on HuggingFace before any pruning.

>It would help if there docs had more English examples. Right now, it seems that they are targeting mostly the Chinese audience which is in line with what you said regarding how they "market" it...

Yes, that's why I'm trying to "market" it to English speakers who want instruct models right now. If more people dig into it then we can collectively work out out answers to questions like these. I actually have an all-English fork in the works which I will publish soon.

alexconstant9108 · 2023-03-21T21:12:44Z

I actually have an all-English fork in the works which I will publish soon.

Great! You seem to be moving really fast! They only published the 4-bit model 2 days ago :)
Hopefully, we'll see also see support for GLM-6B within llama.cpp or end up with a glm.cpp fork :)
One more thing - please also include more info about what datasets the model has been trained (and RLHF fine tuned) on (was it similar to LLama, e.g. on the Pile, etc.).
Usually with software / models coming out of China people are wary because of any potential (ab)use of party propaganda materials / censorship and so on.

i-am-neo · 2023-03-21T22:52:53Z

@i-am-neo Very interesting. So in that setup did you fine-tune Flan-T5 on your own dataset so that it can answer those questions accurately? Also can you give any example question / answer pairs (with any sensitive data masked for privacy reasons, ofcourse )

There is an alpaca quantized version available for download somewhere on github if you look close enough :)

BTW regarding running flan-t5 (and llama) locally (not in the browser tho) you will be very excited to try https://bellard.org/ts_server/ . supports both CPU and GPU inference

@alexconstant9108 Thanks for sharing about bellard.org. Very good to know!

Re Flan-T5, no special fine-tuning done in this case. Unfortunately I can't expose question-answer pairs as that would violate user privacy (plus it takes a lot of work to disguise details). I think you may feel similarly if it were your data. You could, if so inclined, try it for yourself, though hopefully because you have a real use case instead of just wanting to kick the tires (limited resources). (ggerganov/whisper.cpp#625)

And yes, I saw those Alpaca weights floating around, thanks. :)

i-am-neo · 2023-03-21T22:57:38Z

@MarkSchmidty useful reference. Thanks From what I have observed GLM is mostly ignored due to it being weaker with English prompts. But it may turn out to be better when conversing in Chinese for those who can speak it...

I have not had this experience. It seems excellent for English. It is mostly ignored because the team behind it does very little to market it outside of China or to non-Chinese speakers.

If you could provide a small comparison table of English prompts when using Alpaca, Flan-U2L and GLM-6B, that would open the eyes of many people for that particular Instruct model It takes 30 seconds to launch and test my free web demo linked above.

@MarkSchmidty Thanks for sharing. I was excited enough to try it right away (both using your notebook and another gradio demo off HF). The model doesn't seem as adept at answering questions drawn from context I give it. Maybe it's the way I gave prompts? In your tests, how did you ask it to answer from specific text?

i-am-neo · 2023-03-22T00:28:39Z

@MarkSchmidty But! chatglm seems quite fluent. Great to see. ps. You probably also know this - it can't be used for commercial purposes.

Prompt:

Write a paragraph about the most beautiful sunset you have ever seen.

Response:

The most beautiful sunset I have ever seen was in a remote corner of the world.� It was a moment of pure magic and wonder, a vision of beauty and tranquility. The sky above was a rich, vibrant pink and orange, a vibrant shade that could be seen even from a great distance. The sun was setting over the horizon, casting long shadows across the landscape, and the sound of the wind rustling through the leaves could be heard clearly.逄

The setting sun was the perfect contrast to the rest of the day.

TokenBender · 2023-03-25T19:16:50Z

Since people in this thread are interested in Instruct models, I recommend checking out chatGLM-6B.

I believe it is more capable than Flan-UL2 in just 6B parameters. I have a one-click web demo of the 4bit GPU version here:

It's not getting much attention due to being the product of a Chinese University. But in my testing it outperforms every other open source instruct model while coming in at just 4GB, even smaller than Alpaca-7B.

Full repo is here: https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md

Hi,

I tried to run your colab demo locally on M2 pro and it failed with below error:
File "/opt/anaconda3/lib/python3.9/site-packages/cpm_kernels/library/nvrtc.py", line 5, in
nvrtc = Lib("nvrtc")
File "/opt/anaconda3/lib/python3.9/site-packages/cpm_kernels/library/base.py", line 59, in init
raise RuntimeError("Unknown platform: %s" % sys.platform)
RuntimeError: Unknown platform: darwin

Considering the RAM req. are so low I supposed it could work on M2 directly but looks like there is still something that needs to be changed for it?

MarkSchmidty · 2023-03-25T20:21:08Z

Th colab demo is meant to run on a free google colab GPU not on a local runtime (and definitely not CPU).

If you want to run chatGLM on a local CPU you should follow the instructions in the official chatGLM repository for running on CPU. It is very fast on GPU but very slow on CPU currently.

iliemihai92 · 2023-03-27T13:41:01Z

To be clear: Flan-UL2 is the best open source model trained on instruction dataset. Other models such as Llama or GhatGLM are open for research. ChatGLM may be worst at instruction tuning due to Chinese vocabulary

MarkSchmidty · 2023-03-27T17:49:06Z

I get awful results with Flan-UL2. Its responses tend to be extremely short and it hallucinates more than most models when it doesn't know something. I have had no issues with chatGLM's English abilities.

But for fully open source models, I have had good results with the newer OpenAssistant STF-1: https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b (Note: The HuggingFace inference API version is broken and returns results nothing like the actual model does in practice.)

i-am-neo · 2023-03-27T18:44:42Z

Maybe we all can share the prompts we're talking about so our baselines for comparison are not moving about. I have a feeling some folks here are trying models for text generation vs question-answering which are different use cases...

MarkSchmidty · 2023-03-27T18:53:13Z

Here's an example of a question to Flan-UL2 where it is both wrong and characteristically short, even when asked to explain. (Gears 1 and 6 spin in opposite directions, as they are odd and even numbered gears respectively.) This shortness is highly typical of my experiences with UL2.

ghost · 2023-03-27T18:53:35Z

IME text generating models (even ones that explicitly say they don't like salesforce's codegen models) work fine as question/answer models if you use few shot learning (prepend your real question with a couple rounds of fake converstion.) According to OpenAI's instruct GPT paper the major improvement from their HRL was that this wasn't necesary.

i-am-neo · 2023-03-27T19:42:49Z

@MarkSchmidty Since I don't know much about gears... Also I believe the Flan models activate their "explain" reasoning via "Let's think step by step."

av · 2023-04-08T01:34:12Z

Hey, @i-am-neo 👋🏻
Sorry for bothering, but I'm curious if you had a chance to discover anything interesting in relation to conversion of t5 family of models to ggml since the last activity in this thread.
Thank you!

i-am-neo · 2023-04-10T15:24:53Z

Hi @av! I haven't yet. Want to share something you're working on?

av · 2023-04-10T15:57:25Z

Hi @av! I haven't yet. Want to share something you're working on?

Nothing too specific, but looking for a way to reduce runtime costs for the T5 family with the same context from your message above:

I have found Flan-T5 performant when one needs accurate answers to questions (no inventions allowed). This is from real-life data, details disguised for privacy. Flan-T5 was used in its recipe.

Tried to follow the ggml examples only to later discover that there's an open branch for t5 integration. Unfortunately finishing it is out of reach for my level of C++ (and understanding of inference mechanics in LLM, to be fair).

Not sure if the following is of any interest to you, but also tried to run it with ONNX (only a minor boost for small batches, worse than PyTorch), TensorRT (the setup process is quite demanding, unfortunately, completely excludes CPU inference), CTranslate2 (better boost than ONNX for CPU inference). Currently, leaning towards Flan T5 Alpaca robo-camelid, There's also an Alpaca fine-tuned UL2 version, but haven't tried it yet.

i-am-neo · 2023-04-10T18:26:08Z

@av Thanks for sharing.

Nothing too specific, but looking for a way to reduce runtime costs for the T5 family with the same context from your message above:

We may be sometime before being able to run a good LLM on the browser - as even whisper.wasm can only run its small model (244M parameters) at 1G of memory.
Or is your goal reducing inference time?

I am, however, able to run Flan-T5 inference without a GPU. Not using quantization, as I've seen odd results with LLM.int8.

I'm not sold on the results of Flan-T5 finetuned on Alpaca data.
With declare-lab/flan-alpaca-xl:

With quantized 8-bit https://huggingface.co/spaces/joaogante/transformers_streaming:

Plain vanilla Flan-T5-xl:

ghost · 2023-04-11T03:56:02Z

I am also restricted by the license, and prefer truly open sourced models like flan-t5.

In my case, I'm particularly interested in question answering with context. I find flan-t5 provides extractive answers, probably because the prompt triggers its instruction tuned on the Squad dataset. I am seeking for more abstractive answers.

@av @i-am-neo Do you guys have any suggestions? Thank you.

i-am-neo · 2023-04-11T17:04:33Z

@jasontian6666 say more about what you mean by "question answering with context." Context from where? your own data? from the web? Maybe share some examples.

ghost · 2023-04-12T20:29:52Z

@jasontian6666 say more about what you mean by "question answering with context." Context from where? your own data? from the web? Maybe share some examples.

Just the typical open book QA. You provide a question and context, and the model generates answer.

i-am-neo · 2023-04-17T19:20:25Z

I hear you @jasontian6666 . I don't quite grasp what you meant by "with context" - just guessing but you want the answers paraphrased maybe?

If so try Flan-UL2. It's more fluent then the Flan-T5 series.

turian · 2023-12-31T11:32:05Z

I am also curious if there are plans to include flan-t5 in llama.cpp

ggerganov · 2023-12-31T12:02:25Z

Definitely, though it is not high prio atm, so will hope for a community contribution in the meantime

jorirsan · 2024-02-18T20:17:43Z

Going to comment out that not T5 specifically, but more support for available Encoder-Decoder models in llama.cpp would be amazing. From what I've seen, Enc-Dec support on most optimized model inference projects seems almost non-existent. It is really a shame since Enc-Dec models currently tend to offer the best performance in tasks such as Machine Translation and Automatic Speech Recognition.

maziyarpanahi · 2024-02-19T10:19:33Z

Speaking of T5 support, Aya model(s) is based on mT5 and it's trending now. It's 13B, so it does require quantization.

Would be great if we can support it now that the T5/mT5 architectures are back to interesting for text-generation (they were before, but now trending next to Llama-2 and Mistral) https://huggingface.co/CohereForAI/aya-101/discussions/9

ggerganov · 2024-02-19T10:36:21Z

In terms of API, how do these models work?

I imagine something like:

prompt_enc = llama_encode(prompt_text);

while (true) {
    token = llama_decode(prompt_enc);
    
    ? update prompt_enc with sampled token ?
}

Does that make sense? Would be helpful if we get a short summary about how these models are different compared to decoder-only models

lukestanley · 2024-02-28T11:12:19Z

Sorry if I misunderstood your request, but encoder-decoder models perform "translations", like Whisper, or Google Translate. They can do summarisation, Q&A, and even completion too, if so trained.

Encoder-decoder models might be exposed to an end user with a similar GUI. Rather than it being a completion task, it's all about one specific input, and one specific output. That is a separate thing, not a continuation. It's "translate" vs "autocomplete" (even though they can do completion, it's not done as a continuation).

The code you sketched seems to be a completion, which is quite different. The API would probably need to be a bit different to support these sorts of models.

Some encoder-decoder models are inherently less at risk to instruction prompt injection and can be made to run really fast on mobile devices more easily because of their constrained task focus. Both are really important features for solving problems quickly and reliably.

I imagine it would be more like this:

prompt_enc = llama_encode(prompt_text);

// Empty string to collect output tokens
result_output = "";

for (int i = 0; i < max_output_length; ++i) {
    string token = llama_decode_step(prompt_enc, result_output);
    result_output += token;
    if (token == END_TOKEN) {  
        break;
    }
}

Encoder-Decoder models in llama.cpp could be amazing.
I hope this comment is useful, much respect, I'm using Llama.cpp all time time, you're a hero!
@ggerganov

github-actions · 2024-04-14T01:06:17Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

gjmulder added model Model specific generation quality Quality of model output enhancement New feature or request labels Mar 19, 2023

Piezoid mentioned this issue Mar 24, 2023

Trace model outputs to a binary file #477

Closed

bradfox2 mentioned this issue May 8, 2023

fastchat-t5 quantization support? lm-sys/FastChat#925

Open

AsakusaRinne mentioned this issue Jun 11, 2023

cyrillic doesn't work SciSharp/LLamaSharp#12

Closed

ggerganov mentioned this issue Feb 28, 2024

llama : add T5 (encoder-decoder) support #5763

Closed

github-actions bot added the stale label Mar 30, 2024

github-actions bot closed this as completed Apr 14, 2024

fairydreaming mentioned this issue Jun 27, 2024

Inference support for T5 and FLAN-T5 model families #8141

Merged

4 tasks

How to use ggml for Flan-T5 #247

How to use ggml for Flan-T5 #247

Comments

i-am-neo commented Mar 17, 2023

alexconstant9108 commented Mar 18, 2023

michaelbogdan commented Mar 19, 2023

i-am-neo commented Mar 19, 2023

alexconstant9108 commented Mar 20, 2023 • edited Loading

MarkSchmidty commented Mar 21, 2023 • edited Loading

alexconstant9108 commented Mar 21, 2023 • edited Loading

MarkSchmidty commented Mar 21, 2023

alexconstant9108 commented Mar 21, 2023 • edited Loading

MarkSchmidty commented Mar 21, 2023

alexconstant9108 commented Mar 21, 2023 • edited Loading

i-am-neo commented Mar 21, 2023

i-am-neo commented Mar 21, 2023

i-am-neo commented Mar 22, 2023 • edited Loading

TokenBender commented Mar 25, 2023

MarkSchmidty commented Mar 25, 2023

iliemihai92 commented Mar 27, 2023

MarkSchmidty commented Mar 27, 2023 • edited Loading

i-am-neo commented Mar 27, 2023

MarkSchmidty commented Mar 27, 2023 • edited Loading

ghost commented Mar 27, 2023 via email

i-am-neo commented Mar 27, 2023

av commented Apr 8, 2023

i-am-neo commented Apr 10, 2023

av commented Apr 10, 2023

i-am-neo commented Apr 10, 2023

ghost commented Apr 11, 2023

i-am-neo commented Apr 11, 2023

ghost commented Apr 12, 2023

i-am-neo commented Apr 17, 2023

turian commented Dec 31, 2023

ggerganov commented Dec 31, 2023

jorirsan commented Feb 18, 2024 • edited Loading

maziyarpanahi commented Feb 19, 2024

ggerganov commented Feb 19, 2024

lukestanley commented Feb 28, 2024 • edited Loading

github-actions bot commented Apr 14, 2024

alexconstant9108 commented Mar 20, 2023 •

edited

Loading

MarkSchmidty commented Mar 21, 2023 •

edited

Loading

alexconstant9108 commented Mar 21, 2023 •

edited

Loading

alexconstant9108 commented Mar 21, 2023 •

edited

Loading

alexconstant9108 commented Mar 21, 2023 •

edited

Loading

i-am-neo commented Mar 22, 2023 •

edited

Loading

MarkSchmidty commented Mar 27, 2023 •

edited

Loading

MarkSchmidty commented Mar 27, 2023 •

edited

Loading

jorirsan commented Feb 18, 2024 •

edited

Loading

lukestanley commented Feb 28, 2024 •

edited

Loading