Please support gemma arch #1627

NeonBohdan · 2024-02-21T14:44:29Z

It's a unique model
With 256K tokenizer like MT5 but decoder only
So hoping for good multi language capabilities comapared to llama tokenizer version

Hoping it will be easy enough(like llama -> mistal)
As I see this project is now harder to maintain
But better than llama.cpp or vllm in my opinion

https://huggingface.co/google/gemma-7b-it

Maybe this will help:
ggerganov/llama.cpp#5631
vllm-project/vllm#2960

minhthuc2502 · 2024-02-27T17:01:59Z

Support Gemma soon with #1631

ymoslem · 2024-03-07T20:10:02Z

Hi @NeonBohdan Have you tried Gemma with CTranslate2? Does it generate the same output as Transformers? It seems it starts with a good generation, and then continues with repeated words. However, I might be missing something.

NeonBohdan · 2024-03-07T20:24:24Z

@ymoslem, I haven't compiled ctranslate2 to test it yet, and I'm waiting for the release.
However, there seems to be an issue with Gemma-it, compared to Mistral.
The situation strengthens with quantization.

You can try using a repetition penalty, but overall, I've observed this problem as well.

ymoslem · 2024-03-07T20:49:59Z

Thanks, @NeonBohdan for your response! I tried repetition_penalty but it does not seem to help.
I suspect it can be an issue with quantization. I will try without it and see.

NeonBohdan · 2024-04-03T06:59:44Z

@ymoslem for sure you want to left repetition_penalty=0

It's the most stable generation case(llama.cpp issues info)

The problem with quantization

ymoslem · 2024-04-03T09:01:54Z

Thanks @NeonBohdan! Are you aware of any quantization implementation that solves this issue, or is it something to do with the model itself?

NeonBohdan · 2024-04-03T09:22:21Z

@ymoslem read this

Right now Gemma isn't fully fixed
Two biggest problems

gelu vs apriximation
different dtype for different layer for best stability

So even at bfloat16 or float16 you may see problems (Maybe float16 less problematic, because of embedding scaling)

And if you add quantization for this model it may be critical to leave some layers unquantized (which ctranslate2 don't support)

So I don't really see a solution exept of waiting
Or trying:
vllm (without quantization may work better, with may not work)
llama.cpp (quantization works, but no beam search)

But I like ctranslate2 the most
It's the most production ready package

ymoslem · 2024-04-04T23:01:43Z

Thanks @NeonBohdan for the explanations.

minhthuc2502 closed this as completed Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please support gemma arch #1627

Please support gemma arch #1627

NeonBohdan commented Feb 21, 2024 •

edited

Loading

minhthuc2502 commented Feb 27, 2024

ymoslem commented Mar 7, 2024

NeonBohdan commented Mar 7, 2024

ymoslem commented Mar 7, 2024

NeonBohdan commented Apr 3, 2024 •

edited

Loading

ymoslem commented Apr 3, 2024

NeonBohdan commented Apr 3, 2024 •

edited

Loading

ymoslem commented Apr 4, 2024

Please support gemma arch #1627

Please support gemma arch #1627

Comments

NeonBohdan commented Feb 21, 2024 • edited Loading

minhthuc2502 commented Feb 27, 2024

ymoslem commented Mar 7, 2024

NeonBohdan commented Mar 7, 2024

ymoslem commented Mar 7, 2024

NeonBohdan commented Apr 3, 2024 • edited Loading

ymoslem commented Apr 3, 2024

NeonBohdan commented Apr 3, 2024 • edited Loading

ymoslem commented Apr 4, 2024

NeonBohdan commented Feb 21, 2024 •

edited

Loading

NeonBohdan commented Apr 3, 2024 •

edited

Loading

NeonBohdan commented Apr 3, 2024 •

edited

Loading