Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra linear scaling in LlamaRotaryEmbedding classes #29765

Closed
gtebbutt opened this issue Mar 20, 2024 · 3 comments · Fixed by #29808
Closed

Extra linear scaling in LlamaRotaryEmbedding classes #29765

gtebbutt opened this issue Mar 20, 2024 · 3 comments · Fixed by #29808
Labels

Comments

@gtebbutt
Copy link

I've been doing some work around NTK and YaRN scaling, and I noticed that PR #29198 adds linear scaling t = t / self.scaling_factor to the constructor for the LlamaRotaryEmbedding base class, possibly pulled in from GPTNeoXLinearScalingRotaryEmbedding (which uses a different style of forward function to LlamaLinearScalingRotaryEmbedding) - if I'm reading this correctly, it's a behavioural change that means the Llama RoPE subclasses are now being scaled twice, once linearly by the base class constructor and then again by their respective forward functions.

Should be a simple fix to just take out this one line, assuming I haven't missed anything: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L108

@amyeroberts
Copy link
Collaborator

cc @ArthurZucker @gante

@gante
Copy link
Member

gante commented Mar 22, 2024

Hi @gtebbutt 👋

It is not a bug, but a backward compatibility-intended duplication. If you look at the code, the line t = t / self.scaling_factor is used exclusively to build the backward compatible sin_cache and cos_cache. These are currently unused in the model forward pass. As such, there is no duplication :)

In any case, it is a fair doubt -- I had to double-check the code, despite being familiar with it! To avoid regressions, I'm opening a PR with a test to confirm the correctness of RoPE scaling, where a duplicated scaling factor would cause the test to fail.

@gtebbutt
Copy link
Author

Thanks @gante - appreciate you taking the time to check, I'd missed that subtlety around the caching!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants