Support Gemma #940

carmocca · 2024-02-21T14:16:34Z

Announcement: https://blog.google/technology/developers/gemma-open-models/
Technical report: https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf
HF Hub weights: https://huggingface.co/google/gemma-7b
HF Transformers PR: huggingface/transformers#29167 with https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma/modeling_gemma.py as the model implementation

From a brief skim, I think it just needs to add geglu as the activation

The text was updated successfully, but these errors were encountered:

rasbt · 2024-02-21T14:28:06Z

Based on the paper:

We already support multi-query attention (through Llama)
Yes, GeGLU is one of the "novelties"
The other novelty is that they apply RMSNorm before and after attention instead only one or the other as in Post/Pre-Norm

rasbt · 2024-02-21T14:30:00Z

There doesn't seem to be an official GeGLU implementation in PyTorch, yet, but this looks good: https://github.com/pfnet-research/deep-table/blob/237c8be8a405349ce6ab78075234c60d9bfe60b7/deep_table/nn/layers/activation.py#L22

carmocca · 2024-02-21T14:44:00Z

The other novelty is that they apply RMSNorm before and after attention instead only one or the other as in Post/Pre-Norm

Can you be specific? It seems to be exactly what we implement: https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/model.py#L154-L166 and seems to match what's in HF: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma/modeling_gemma.py#L621-L640

carmocca · 2024-02-21T14:44:17Z

BTW this is up-for-grabs if one of you want to add this quickly

rasbt · 2024-02-21T14:57:07Z

Can you be specific? It seems to be exactly what we implement: https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/model.py#L154-L166 and seems to match what's in HF: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma/modeling_gemma.py#L621-L640

Hm, dunno what they mean. Even GPT-2 had a LayerNorm before and after each multihead attention module, so I thought they meant something different since they specifically highlighted that. From the paper:

" We normalize both the input and the output of each transformer sub-layer,
a deviation from the standard practice of solely
normalizing one or the other."

Maybe they mean that they added an additional normalization after the feedforward module.

Andrei-Aksionov · 2024-02-21T15:05:48Z

Maybe something wrong with the HF implementation?
Because they don't even use geglu, only a standard gelu.

I think it's better to check how it's implemented in Keras.

Update:

There doesn't seem to be an official GeGLU implementation in PyTorch, yet, but this looks good: https://github.com/pfnet-research/deep-table/blob/237c8be8a405349ce6ab78075234c60d9bfe60b7/deep_table/nn/layers/activation.py#L22

Oh, I thought that GeGlu is a some smart variant that improves Gelu, but it's just that weird thing as in Olmo.
I expected a lot of math, but not this.
I don't like that the activation function changes size of the input, it's an unexpected behavior IMO, but if it's a thing I guess I just have to get used to it.

Update 2:
In their convert_gemma_weights.py code I haven't noticed anything related to MLP weights.
And they use the regular LlaMAMLP code. Maybe Google provided already prepared weights, where up and gate weights aren't merged 🤷‍♂️?

In theory, the code should work with just updating the config file.

Andrei-Aksionov · 2024-02-21T15:08:50Z

We normalize both the input and the output of each transformer sub-layer

When I read it I imagined something like this:

norm
attention
norm
MLP
norm

rasbt · 2024-02-21T15:19:45Z

Yes, same. It's weird. I agree we should check the Keras code.

Andrei-Aksionov · 2024-02-21T15:47:13Z

Ok, here is their TransformerBlock from KerasNLP.
Not sure what is going on, but:

They use only 2 norm layers: before attention and before MLP
They don't use geglu thing either 😆

carmocca · 2024-02-21T15:47:30Z

https://github.com/keras-team/keras-nlp/blob/dd3ceb7957ab993fc7cc53890a72239e7a566218/keras_nlp/models/gemma/gemma_decoder_block.py#L167

Keras does implement geglu by having two fc layers - as in our LLaMAMLP class - but their size is halved compared to the size in LLaMAMLP

It's also interesting that they use approximate=True which maps to tanh in PyTorch: https://github.com/keras-team/keras/blob/2ad117c44c5346691a646512ded3d25a5e3cb322/keras/backend/torch/nn.py#L88-L89. This is as in our GptNeoxMLP class.

In conclusion, Gemma needs a new MLP class that is a mix of both

rasbt · 2024-02-21T15:47:45Z

Thanks for checking. In that case, let me submit a PR. Almost done.

carmocca · 2024-02-21T15:49:45Z

geglu is gelu but only applied on half of the input. I agree that the HF impl doesn't look equal to that in Keras

Andrei-Aksionov · 2024-02-21T16:00:51Z

Keras does implement geglu by having two fc layers - as in our LLaMAMLP class - but their size is halved compared to the size in LLaMAMLP

Didn't know that this thing is called geglu. I seriously expected something more math heavy.
I definitely need to read the paper.

What's confusing me is that why do you need to specify an intermediate size, which is used only in DecoderBlock, just to half it in the process. Maybe scaling factor 4x is hard-coded? 🤷‍♂️

In conclusion, Gemma needs a new MLP class that is a mix of both

Yep.

rasbt · 2024-02-21T17:02:23Z

We normalize both the input and the output of each transformer sub-layer

When I read it I imagined something like this:

norm

attention

norm

MLP

norm

The weird thing though is they don't seem to have a 3rd layernorm weight there in the HF checkpoint. It's were the paper and the implementation seem to differ.

carmocca added the model-weights label Feb 21, 2024

rasbt mentioned this issue Feb 21, 2024

Add Gemma checkpoint support #941

Merged

18 tasks

carmocca closed this as completed in #941 Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Gemma #940

Support Gemma #940

carmocca commented Feb 21, 2024 •

edited

Loading

rasbt commented Feb 21, 2024

rasbt commented Feb 21, 2024

carmocca commented Feb 21, 2024

carmocca commented Feb 21, 2024

rasbt commented Feb 21, 2024

Andrei-Aksionov commented Feb 21, 2024 •

edited

Loading

Andrei-Aksionov commented Feb 21, 2024

rasbt commented Feb 21, 2024

Andrei-Aksionov commented Feb 21, 2024

carmocca commented Feb 21, 2024 •

edited

Loading

rasbt commented Feb 21, 2024

carmocca commented Feb 21, 2024

Andrei-Aksionov commented Feb 21, 2024

rasbt commented Feb 21, 2024

Support Gemma #940

Support Gemma #940

Comments

carmocca commented Feb 21, 2024 • edited Loading

rasbt commented Feb 21, 2024

rasbt commented Feb 21, 2024

carmocca commented Feb 21, 2024

carmocca commented Feb 21, 2024

rasbt commented Feb 21, 2024

Andrei-Aksionov commented Feb 21, 2024 • edited Loading

Andrei-Aksionov commented Feb 21, 2024

rasbt commented Feb 21, 2024

Andrei-Aksionov commented Feb 21, 2024

carmocca commented Feb 21, 2024 • edited Loading

rasbt commented Feb 21, 2024

carmocca commented Feb 21, 2024

Andrei-Aksionov commented Feb 21, 2024

rasbt commented Feb 21, 2024

carmocca commented Feb 21, 2024 •

edited

Loading

Andrei-Aksionov commented Feb 21, 2024 •

edited

Loading

carmocca commented Feb 21, 2024 •

edited

Loading