Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please support gemma arch #1627

Closed
NeonBohdan opened this issue Feb 21, 2024 · 8 comments
Closed

Please support gemma arch #1627

NeonBohdan opened this issue Feb 21, 2024 · 8 comments

Comments

@NeonBohdan
Copy link

NeonBohdan commented Feb 21, 2024

It's a unique model
With 256K tokenizer like MT5 but decoder only
So hoping for good multi language capabilities comapared to llama tokenizer version

Hoping it will be easy enough(like llama -> mistal)
As I see this project is now harder to maintain
But better than llama.cpp or vllm in my opinion

https://huggingface.co/google/gemma-7b-it

Maybe this will help:
ggerganov/llama.cpp#5631
vllm-project/vllm#2960

@minhthuc2502
Copy link
Collaborator

Support Gemma soon with #1631

@ymoslem
Copy link

ymoslem commented Mar 7, 2024

Hi @NeonBohdan Have you tried Gemma with CTranslate2? Does it generate the same output as Transformers? It seems it starts with a good generation, and then continues with repeated words. However, I might be missing something.

@NeonBohdan
Copy link
Author

@ymoslem, I haven't compiled ctranslate2 to test it yet, and I'm waiting for the release.
However, there seems to be an issue with Gemma-it, compared to Mistral.
The situation strengthens with quantization.

You can try using a repetition penalty, but overall, I've observed this problem as well.

@ymoslem
Copy link

ymoslem commented Mar 7, 2024

Thanks, @NeonBohdan for your response! I tried repetition_penalty but it does not seem to help.
I suspect it can be an issue with quantization. I will try without it and see.

@NeonBohdan
Copy link
Author

NeonBohdan commented Apr 3, 2024

@ymoslem for sure you want to left repetition_penalty=0

It's the most stable generation case(llama.cpp issues info)

The problem with quantization

@ymoslem
Copy link

ymoslem commented Apr 3, 2024

Thanks @NeonBohdan! Are you aware of any quantization implementation that solves this issue, or is it something to do with the model itself?

@NeonBohdan
Copy link
Author

NeonBohdan commented Apr 3, 2024

@ymoslem read this

Right now Gemma isn't fully fixed
Two biggest problems

  1. gelu vs apriximation
  2. different dtype for different layer for best stability

So even at bfloat16 or float16 you may see problems (Maybe float16 less problematic, because of embedding scaling)

And if you add quantization for this model it may be critical to leave some layers unquantized (which ctranslate2 don't support)


So I don't really see a solution exept of waiting
Or trying:
vllm (without quantization may work better, with may not work)
llama.cpp (quantization works, but no beam search)

But I like ctranslate2 the most
It's the most production ready package

@ymoslem
Copy link

ymoslem commented Apr 4, 2024

Thanks @NeonBohdan for the explanations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants