Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GradientAI Auto ROPE Base calculation #910

Merged
merged 5 commits into from
Jun 13, 2024

Conversation

askmyteapot
Copy link

https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models has a formula that better fits the ideal rope scaling.

Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.

https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models
has a formula that better fits the ideal rope scaling. 

Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.
@askmyteapot
Copy link
Author

askmyteapot commented Jun 11, 2024

Have done some benchmarks using Perplexity

Results are as follows:

Model Base Rope Tested Rope Kobo or Gradient CTX_train CTX_test PPL Result
L3-15B-q8 500000 1638400.0 Kobo 8192 16384 6.3957 +/- 0.04083
L3-15B-q8 500000 1776948.1 Gradient 8192 16384 6.0832 +/- 0.03830
L3-15B-q8 500000 1843000.0 Manual 8192 16384 6.1221 +/- 0.03865
L2-13B-q4 10000 65536 Kobo 4096 16384 6.8271 +/- 0.04360
L2-13B-q4 10000 71738 Gradient 4096 16384 6.9586 +/- 0.04421
L2-13B-q4 10000 49152 Kobo 4096 12288 6.0357 +/- 0.03804
L2-13B-q4 10000 47661 Gradient 4096 12288 6.0041 +/- 0.03785
L2-13B-q4 10000 32768 Kobo 4096 8192 6.0434 +/- 0.03913
L2-13B-q4 10000 26784 Gradient 4096 8192 5.9039 +/- 0.03831

For Llama3, definitely a better fit.
For Llama2, a better fit for doubling and tripling context, but worse for quadrupling. (however 4x context on Llama2 is 12GB just for KV, so i doubt most will use it)

Hope that helps.

EDIT: I'm an idiot and used my manual tuning result on llama3 and didnt actually include the formula result. Table has been updated now.

@askmyteapot askmyteapot marked this pull request as draft June 12, 2024 04:03
@askmyteapot
Copy link
Author

askmyteapot commented Jun 12, 2024

Changing to draft. Discovered some scaling issues with Solar models (mistral 7b 0.1 with scaling window)

Discovered SWA models require the CTX figures to be 8x to get close to a suitable rope base

Solar based models require the context values to be multiplied by 8. This is (i'm guessing) because the positions as based on a 32k context, but sliding window of 4k.
adding in tensor count to identify solar models based on tensor count of 435.
add in n_tensor count for solar identification
@askmyteapot askmyteapot marked this pull request as ready for review June 12, 2024 10:07
@askmyteapot
Copy link
Author

Ok. i managed to get a workable solution for Solar based models (like fimb). Had to use the total tensor count of 435 with the base of 10000 to identify them.

i havent figured out a decent way to identify original Mistral 7B v0.1 models, but in theory, it should use the previous logic, as it 'thinks' it has a context of 32k. If GGUF had metadata for "sliding window" then it would be easy.

@askmyteapot askmyteapot marked this pull request as draft June 12, 2024 14:13
@LostRuins
Copy link
Owner

Is this PR ready for review yet, or still in development?

gpttype_adapter.cpp Outdated Show resolved Hide resolved
gpttype_adapter.cpp Outdated Show resolved Hide resolved
@askmyteapot askmyteapot marked this pull request as ready for review June 13, 2024 08:33
@LostRuins LostRuins added the enhancement New feature or request label Jun 13, 2024
Copy link
Owner

@LostRuins LostRuins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, please confirm and ill merge

Copy link
Author

@askmyteapot askmyteapot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :)

@LostRuins LostRuins merged commit 1e72b65 into LostRuins:concedo_experimental Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants