-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GradientAI Auto ROPE Base calculation #910
GradientAI Auto ROPE Base calculation #910
Conversation
https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models has a formula that better fits the ideal rope scaling. Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.
Have done some benchmarks using Perplexity Results are as follows:
For Llama3, definitely a better fit. Hope that helps. EDIT: I'm an idiot and used my manual tuning result on llama3 and didnt actually include the formula result. Table has been updated now. |
Changing to draft. Discovered some scaling issues with Solar models (mistral 7b 0.1 with scaling window) Discovered SWA models require the CTX figures to be 8x to get close to a suitable rope base |
Solar based models require the context values to be multiplied by 8. This is (i'm guessing) because the positions as based on a 32k context, but sliding window of 4k.
adding in tensor count to identify solar models based on tensor count of 435.
add in n_tensor count for solar identification
Ok. i managed to get a workable solution for Solar based models (like fimb). Had to use the total tensor count of 435 with the base of 10000 to identify them. i havent figured out a decent way to identify original Mistral 7B v0.1 models, but in theory, it should use the previous logic, as it 'thinks' it has a context of 32k. If GGUF had metadata for "sliding window" then it would be easy. |
Is this PR ready for review yet, or still in development? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, please confirm and ill merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me :)
https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models has a formula that better fits the ideal rope scaling.
Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.