Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use ggml for Flan-T5 #247

Closed
i-am-neo opened this issue Mar 17, 2023 · 36 comments
Closed

How to use ggml for Flan-T5 #247

i-am-neo opened this issue Mar 17, 2023 · 36 comments
Labels
enhancement New feature or request generation quality Quality of model output model Model specific stale

Comments

@i-am-neo
Copy link

@ggerganov Thanks for sharing llama.cpp. As usual, great work.

Question rather than issue. How difficult would it be to make ggml.c work for a Flan checkpoint, like T5-xl/UL2, then quantized?

Would love to be able to have those models run on a browser, much like what you did with whisper.cpp wasm.

Thanks again. (I can move this post somewhere else if you prefer since it's not technically about Llama. Just let me know where.)

@alexconstant9108
Copy link

@i-am-neo just curious, have you found T5-xl/UL2 to be in any way superior to FB's llama models (even the smallest 7b model) for certain tasks? IIRC T5-xxl used to be the best open source model before llama, but now...?

@michaelbogdan
Copy link

Since LLaMA wasn't published under a FOSS license and explicitly does not allow commercial use, it shouldn't be considered an open source model.

@gjmulder gjmulder added model Model specific generation quality Quality of model output enhancement New feature or request labels Mar 19, 2023
@i-am-neo
Copy link
Author

@i-am-neo just curious, have you found T5-xl/UL2 to be in any way superior to FB's llama models (even the smallest 7b model) for certain tasks? IIRC T5-xxl used to be the best open source model before llama, but now...?

@alexconstant9108 I have found Flan-T5 performant when one needs accurate answers to questions (no inventions allowed). This is from real-life data, details disguised for privacy. Flan-T5 was used in its recipe.

Flan-UL2 looks to be more "fluent"/expressive than Flan-T5, but I've just started to look.

I won't comment on text generation nor "general knowledge queries" since I don't use these models to write novels, nor do I care if they know about Batman and Robin.

What I found beautiful about Stanford Alpaca was that a 7B autoregressive model can be coaxed to follow instructions. The Llama paper mentions "Llama-I" which compared favorably to other instructed models at around 65B parameters). As to how well Alpaca does at 7B, we'll have to see (I've requested the Stanford team to run stats and hopefully they are willing to respond).

What I find beautiful about the idea of llama.cpp or the possibility of a Flan-T5.cpp is running a model locally, hopefully at some point on the browser. If one could shrink Flan-UL2 down to xB parameters and it still performs well... (Leche Flan anyone?).

I'm of the thinking a sequence-to-sequence model and an autogenerative are better harnessed to the best capabilities of each, depending on what one needs to accomplish, rather than pitted against each other on a one-on-one comparison.

In the end, as our field evolves, we all need to ask the question touched on in the Llama paper, italics mine - "The focus of this work is to train a series of
language models that achieve the best possible performance at various inference budgets, by ..."

@alexconstant9108
Copy link

alexconstant9108 commented Mar 20, 2023

@i-am-neo
Very interesting. So in that setup did you fine-tune Flan-T5 on your own dataset so that it can answer those questions accurately? Also can you give any example question / answer pairs (with any sensitive data masked for privacy reasons, ofcourse )

There is an alpaca quantized version available for download somewhere on github if you look close enough :)

BTW regarding running flan-t5 (and llama) locally (not in the browser tho) you will be very excited to try https://bellard.org/ts_server/ . supports both CPU and GPU inference

@MarkSchmidty
Copy link

MarkSchmidty commented Mar 21, 2023

Since people in this thread are interested in Instruct models, I recommend checking out chatGLM-6B.

I believe it is more capable than Flan-UL2 in just 6B parameters. I have a one-click web demo of the 4bit GPU version here: Launch In Colab

It's not getting much attention due to being the product of a Chinese University. But in my testing it outperforms every other open source instruct model while coming in at just 4GB, even smaller than Alpaca-7B.

Full repo is here: https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md

@alexconstant9108
Copy link

alexconstant9108 commented Mar 21, 2023

@MarkSchmidty useful reference. Thanks
From what I have observed GLM is mostly ignored due to it being weaker with English prompts. But it may turn out to be better when conversing in Chinese for those who can speak it...
If you could provide a small comparison table of English prompts when using Alpaca, Flan-U2L and GLM-6B, that would open the eyes of many people for that particular Instruct model

@MarkSchmidty
Copy link

@MarkSchmidty useful reference. Thanks From what I have observed GLM is mostly ignored due to it being weaker with English prompts. But it may turn out to be better when conversing in Chinese for those who can speak it...

I have not had this experience. It seems excellent for English.
It is mostly ignored because the team behind it does very little to market it outside of China or to non-Chinese speakers.

>If you could provide a small comparison table of English prompts when using Alpaca, Flan-U2L and GLM-6B, that would open the eyes of many people for that particular Instruct model
It takes 30 seconds to launch and test my free web demo linked above.

@alexconstant9108
Copy link

alexconstant9108 commented Mar 21, 2023

But in my testing it outperforms every other open source instruct model while coming in at just 4GB, even smaller than Alpaca-7B.

BTW, the readme says:

Quantization Level GPU Memory
FP16(no quantization) 13 GB
INT8 10 GB
INT4 6 GB

How did you manage to shrink it to 4GB? Also is it using GPTQ or RTN quantization for 4-bit?
Edit, I saw that in another place, the authors mention 5.2GB CPU memory
and GPT-Q quantization. It would help if there docs had more English examples. Right now, it seems that they are targeting mostly the Chinese audience which is in line with what you said regarding how they "market" it...

@MarkSchmidty
Copy link

That is the GPU memory required to run inference not the model size. 

The official int4 model is 4.06GB on HuggingFace before any pruning.

>It would help if there docs had more English examples. Right now, it seems that they are targeting mostly the Chinese audience which is in line with what you said regarding how they "market" it...

Yes, that's why I'm trying to "market" it to English speakers who want instruct models right now. If more people dig into it then we can collectively work out out answers to questions like these. I actually have an all-English fork in the works which I will publish soon.

@alexconstant9108
Copy link

alexconstant9108 commented Mar 21, 2023

I actually have an all-English fork in the works which I will publish soon.

Great! You seem to be moving really fast! They only published the 4-bit model 2 days ago :)
Hopefully, we'll see also see support for GLM-6B within llama.cpp or end up with a glm.cpp fork :)
One more thing - please also include more info about what datasets the model has been trained (and RLHF fine tuned) on (was it similar to LLama, e.g. on the Pile, etc.).
Usually with software / models coming out of China people are wary because of any potential (ab)use of party propaganda materials / censorship and so on.

@i-am-neo
Copy link
Author

@i-am-neo Very interesting. So in that setup did you fine-tune Flan-T5 on your own dataset so that it can answer those questions accurately? Also can you give any example question / answer pairs (with any sensitive data masked for privacy reasons, ofcourse )

There is an alpaca quantized version available for download somewhere on github if you look close enough :)

BTW regarding running flan-t5 (and llama) locally (not in the browser tho) you will be very excited to try https://bellard.org/ts_server/ . supports both CPU and GPU inference

@alexconstant9108 Thanks for sharing about bellard.org. Very good to know!

Re Flan-T5, no special fine-tuning done in this case. Unfortunately I can't expose question-answer pairs as that would violate user privacy (plus it takes a lot of work to disguise details). I think you may feel similarly if it were your data. You could, if so inclined, try it for yourself, though hopefully because you have a real use case instead of just wanting to kick the tires (limited resources). (ggerganov/whisper.cpp#625)

And yes, I saw those Alpaca weights floating around, thanks. :)

@i-am-neo
Copy link
Author

@MarkSchmidty useful reference. Thanks From what I have observed GLM is mostly ignored due to it being weaker with English prompts. But it may turn out to be better when conversing in Chinese for those who can speak it...

I have not had this experience. It seems excellent for English. It is mostly ignored because the team behind it does very little to market it outside of China or to non-Chinese speakers.

If you could provide a small comparison table of English prompts when using Alpaca, Flan-U2L and GLM-6B, that would open the eyes of many people for that particular Instruct model It takes 30 seconds to launch and test my free web demo linked above.

@MarkSchmidty Thanks for sharing. I was excited enough to try it right away (both using your notebook and another gradio demo off HF). The model doesn't seem as adept at answering questions drawn from context I give it. Maybe it's the way I gave prompts? In your tests, how did you ask it to answer from specific text?

@i-am-neo
Copy link
Author

i-am-neo commented Mar 22, 2023

@MarkSchmidty But! chatglm seems quite fluent. Great to see. ps. You probably also know this - it can't be used for commercial purposes.

Prompt:

Write a paragraph about the most beautiful sunset you have ever seen.

Response:

The most beautiful sunset I have ever seen was in a remote corner of the world.� It was a moment of pure magic and wonder, a vision of beauty and tranquility. The sky above was a rich, vibrant pink and orange, a vibrant shade that could be seen even from a great distance. The sun was setting over the horizon, casting long shadows across the landscape, and the sound of the wind rustling through the leaves could be heard clearly.逄

The setting sun was the perfect contrast to the rest of the day.

@TokenBender
Copy link

Since people in this thread are interested in Instruct models, I recommend checking out chatGLM-6B.

I believe it is more capable than Flan-UL2 in just 6B parameters. I have a one-click web demo of the 4bit GPU version here: Launch In Colab

It's not getting much attention due to being the product of a Chinese University. But in my testing it outperforms every other open source instruct model while coming in at just 4GB, even smaller than Alpaca-7B.

Full repo is here: https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md

Hi,

I tried to run your colab demo locally on M2 pro and it failed with below error:
File "/opt/anaconda3/lib/python3.9/site-packages/cpm_kernels/library/nvrtc.py", line 5, in
nvrtc = Lib("nvrtc")
File "/opt/anaconda3/lib/python3.9/site-packages/cpm_kernels/library/base.py", line 59, in init
raise RuntimeError("Unknown platform: %s" % sys.platform)
RuntimeError: Unknown platform: darwin

Considering the RAM req. are so low I supposed it could work on M2 directly but looks like there is still something that needs to be changed for it?

@MarkSchmidty
Copy link

Th colab demo is meant to run on a free google colab GPU not on a local runtime (and definitely not CPU).

If you want to run chatGLM on a local CPU you should follow the instructions in the official chatGLM repository for running on CPU. It is very fast on GPU but very slow on CPU currently.

@iliemihai92
Copy link

To be clear: Flan-UL2 is the best open source model trained on instruction dataset. Other models such as Llama or GhatGLM are open for research. ChatGLM may be worst at instruction tuning due to Chinese vocabulary

@MarkSchmidty
Copy link

MarkSchmidty commented Mar 27, 2023

I get awful results with Flan-UL2. Its responses tend to be extremely short and it hallucinates more than most models when it doesn't know something. I have had no issues with chatGLM's English abilities.

But for fully open source models, I have had good results with the newer OpenAssistant STF-1: https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b (Note: The HuggingFace inference API version is broken and returns results nothing like the actual model does in practice.)

@i-am-neo
Copy link
Author

Maybe we all can share the prompts we're talking about so our baselines for comparison are not moving about. I have a feeling some folks here are trying models for text generation vs question-answering which are different use cases...

@MarkSchmidty
Copy link

MarkSchmidty commented Mar 27, 2023

Here's an example of a question to Flan-UL2 where it is both wrong and characteristically short, even when asked to explain. (Gears 1 and 6 spin in opposite directions, as they are odd and even numbered gears respectively.) This shortness is highly typical of my experiences with UL2.
image

@ghost
Copy link

ghost commented Mar 27, 2023 via email

@i-am-neo
Copy link
Author

@MarkSchmidty Since I don't know much about gears... Also I believe the Flan models activate their "explain" reasoning via "Let's think step by step."

image

@av
Copy link

av commented Apr 8, 2023

Hey, @i-am-neo 👋🏻
Sorry for bothering, but I'm curious if you had a chance to discover anything interesting in relation to conversion of t5 family of models to ggml since the last activity in this thread.
Thank you!

@i-am-neo
Copy link
Author

Hi @av! I haven't yet. Want to share something you're working on?

@av
Copy link

av commented Apr 10, 2023

Hi @av! I haven't yet. Want to share something you're working on?

Nothing too specific, but looking for a way to reduce runtime costs for the T5 family with the same context from your message above:

I have found Flan-T5 performant when one needs accurate answers to questions (no inventions allowed). This is from real-life data, details disguised for privacy. Flan-T5 was used in its recipe.

Tried to follow the ggml examples only to later discover that there's an open branch for t5 integration. Unfortunately finishing it is out of reach for my level of C++ (and understanding of inference mechanics in LLM, to be fair).

Not sure if the following is of any interest to you, but also tried to run it with ONNX (only a minor boost for small batches, worse than PyTorch), TensorRT (the setup process is quite demanding, unfortunately, completely excludes CPU inference), CTranslate2 (better boost than ONNX for CPU inference). Currently, leaning towards Flan T5 Alpaca robo-camelid, There's also an Alpaca fine-tuned UL2 version, but haven't tried it yet.

@i-am-neo
Copy link
Author

@av Thanks for sharing.

Nothing too specific, but looking for a way to reduce runtime costs for the T5 family with the same context from your message above:

We may be sometime before being able to run a good LLM on the browser - as even whisper.wasm can only run its small model (244M parameters) at 1G of memory.
Or is your goal reducing inference time?

I am, however, able to run Flan-T5 inference without a GPU. Not using quantization, as I've seen odd results with LLM.int8.

I'm not sold on the results of Flan-T5 finetuned on Alpaca data.
With declare-lab/flan-alpaca-xl:

Screen Shot 2023-04-10 at 10 56 37 AM

With quantized 8-bit https://huggingface.co/spaces/joaogante/transformers_streaming:

Screen Shot 2023-04-10 at 10 58 20 AM

Plain vanilla Flan-T5-xl:

Screen Shot 2023-04-10 at 11 03 39 AM

@ghost
Copy link

ghost commented Apr 11, 2023

I am also restricted by the license, and prefer truly open sourced models like flan-t5.

In my case, I'm particularly interested in question answering with context. I find flan-t5 provides extractive answers, probably because the prompt triggers its instruction tuned on the Squad dataset. I am seeking for more abstractive answers.

@av @i-am-neo Do you guys have any suggestions? Thank you.

@i-am-neo
Copy link
Author

@jasontian6666 say more about what you mean by "question answering with context." Context from where? your own data? from the web? Maybe share some examples.

@ghost
Copy link

ghost commented Apr 12, 2023

@jasontian6666 say more about what you mean by "question answering with context." Context from where? your own data? from the web? Maybe share some examples.

Just the typical open book QA. You provide a question and context, and the model generates answer.

@i-am-neo
Copy link
Author

I hear you @jasontian6666 . I don't quite grasp what you meant by "with context" - just guessing but you want the answers paraphrased maybe?

If so try Flan-UL2. It's more fluent then the Flan-T5 series.

@turian
Copy link

turian commented Dec 31, 2023

I am also curious if there are plans to include flan-t5 in llama.cpp

@ggerganov
Copy link
Member

Definitely, though it is not high prio atm, so will hope for a community contribution in the meantime

@jorirsan
Copy link

jorirsan commented Feb 18, 2024

Going to comment out that not T5 specifically, but more support for available Encoder-Decoder models in llama.cpp would be amazing. From what I've seen, Enc-Dec support on most optimized model inference projects seems almost non-existent. It is really a shame since Enc-Dec models currently tend to offer the best performance in tasks such as Machine Translation and Automatic Speech Recognition.

@maziyarpanahi
Copy link

Speaking of T5 support, Aya model(s) is based on mT5 and it's trending now. It's 13B, so it does require quantization.

Would be great if we can support it now that the T5/mT5 architectures are back to interesting for text-generation (they were before, but now trending next to Llama-2 and Mistral) https://huggingface.co/CohereForAI/aya-101/discussions/9

@ggerganov
Copy link
Member

In terms of API, how do these models work?

I imagine something like:

prompt_enc = llama_encode(prompt_text);

while (true) {
    token = llama_decode(prompt_enc);
    
    ? update prompt_enc with sampled token ?
}

Does that make sense? Would be helpful if we get a short summary about how these models are different compared to decoder-only models

@lukestanley
Copy link

lukestanley commented Feb 28, 2024

Sorry if I misunderstood your request, but encoder-decoder models perform "translations", like Whisper, or Google Translate. They can do summarisation, Q&A, and even completion too, if so trained.

Encoder-decoder models might be exposed to an end user with a similar GUI. Rather than it being a completion task, it's all about one specific input, and one specific output. That is a separate thing, not a continuation. It's "translate" vs "autocomplete" (even though they can do completion, it's not done as a continuation).

The code you sketched seems to be a completion, which is quite different. The API would probably need to be a bit different to support these sorts of models.

Some encoder-decoder models are inherently less at risk to instruction prompt injection and can be made to run really fast on mobile devices more easily because of their constrained task focus. Both are really important features for solving problems quickly and reliably.

I imagine it would be more like this:

prompt_enc = llama_encode(prompt_text);

// Empty string to collect output tokens
result_output = "";

for (int i = 0; i < max_output_length; ++i) {
    string token = llama_decode_step(prompt_enc, result_output);
    result_output += token;
    if (token == END_TOKEN) {  
        break;
    }
}

Encoder-Decoder models in llama.cpp could be amazing.
I hope this comment is useful, much respect, I'm using Llama.cpp all time time, you're a hero!
@ggerganov

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request generation quality Quality of model output model Model specific stale
Projects
None yet
Development

No branches or pull requests