GradientAI Auto ROPE Base calculation #910

askmyteapot · 2024-06-09T23:46:08Z

https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models has a formula that better fits the ideal rope scaling.

Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.

https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models has a formula that better fits the ideal rope scaling. Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.

askmyteapot · 2024-06-11T01:46:01Z

Have done some benchmarks using Perplexity

Results are as follows:

Model	Base Rope	Tested Rope	Kobo or Gradient	CTX_train	CTX_test	PPL Result
L3-15B-q8	500000	1638400.0	Kobo	8192	16384	6.3957 +/- 0.04083
L3-15B-q8	500000	1776948.1	Gradient	8192	16384	6.0832 +/- 0.03830
L3-15B-q8	500000	1843000.0	Manual	8192	16384	6.1221 +/- 0.03865
L2-13B-q4	10000	65536	Kobo	4096	16384	6.8271 +/- 0.04360
L2-13B-q4	10000	71738	Gradient	4096	16384	6.9586 +/- 0.04421
L2-13B-q4	10000	49152	Kobo	4096	12288	6.0357 +/- 0.03804
L2-13B-q4	10000	47661	Gradient	4096	12288	6.0041 +/- 0.03785
L2-13B-q4	10000	32768	Kobo	4096	8192	6.0434 +/- 0.03913
L2-13B-q4	10000	26784	Gradient	4096	8192	5.9039 +/- 0.03831

For Llama3, definitely a better fit.
For Llama2, a better fit for doubling and tripling context, but worse for quadrupling. (however 4x context on Llama2 is 12GB just for KV, so i doubt most will use it)

Hope that helps.

EDIT: I'm an idiot and used my manual tuning result on llama3 and didnt actually include the formula result. Table has been updated now.

askmyteapot · 2024-06-12T04:04:23Z

Changing to draft. Discovered some scaling issues with Solar models (mistral 7b 0.1 with scaling window)

Discovered SWA models require the CTX figures to be 8x to get close to a suitable rope base

Solar based models require the context values to be multiplied by 8. This is (i'm guessing) because the positions as based on a 32k context, but sliding window of 4k.

adding in tensor count to identify solar models based on tensor count of 435.

add in n_tensor count for solar identification

askmyteapot · 2024-06-12T10:10:42Z

Ok. i managed to get a workable solution for Solar based models (like fimb). Had to use the total tensor count of 435 with the base of 10000 to identify them.

i havent figured out a decent way to identify original Mistral 7B v0.1 models, but in theory, it should use the previous logic, as it 'thinks' it has a context of 32k. If GGUF had metadata for "sliding window" then it would be easy.

LostRuins · 2024-06-13T02:44:28Z

Is this PR ready for review yet, or still in development?

gpttype_adapter.cpp

LostRuins

lgtm, please confirm and ill merge

askmyteapot

Looks good to me :)

askmyteapot marked this pull request as draft June 12, 2024 04:03

askmyteapot added 3 commits June 12, 2024 20:03

add in solar scaling logic

b1076f6

Solar based models require the context values to be multiplied by 8. This is (i'm guessing) because the positions as based on a 32k context, but sliding window of 4k.

Update model_adapter.h

6301dfd

adding in tensor count to identify solar models based on tensor count of 435.

Update model_adapter.cpp

0990603

add in n_tensor count for solar identification

askmyteapot marked this pull request as ready for review June 12, 2024 10:07

askmyteapot marked this pull request as draft June 12, 2024 14:13

LostRuins reviewed Jun 13, 2024

View reviewed changes

gpttype_adapter.cpp Outdated Show resolved Hide resolved

LostRuins reviewed Jun 13, 2024

View reviewed changes

gpttype_adapter.cpp Outdated Show resolved Hide resolved

askmyteapot marked this pull request as ready for review June 13, 2024 08:33

LostRuins added the enhancement New feature or request label Jun 13, 2024

refactor and cleanup GradientAI rope scaling

4fc92d0

LostRuins approved these changes Jun 13, 2024

View reviewed changes

askmyteapot commented Jun 13, 2024

View reviewed changes

LostRuins merged commit 1e72b65 into LostRuins:concedo_experimental Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GradientAI Auto ROPE Base calculation #910

GradientAI Auto ROPE Base calculation #910

askmyteapot commented Jun 9, 2024

askmyteapot commented Jun 11, 2024 •

edited

Loading

askmyteapot commented Jun 12, 2024 •

edited

Loading

askmyteapot commented Jun 12, 2024

LostRuins commented Jun 13, 2024

LostRuins left a comment

askmyteapot left a comment

GradientAI Auto ROPE Base calculation #910

GradientAI Auto ROPE Base calculation #910

Conversation

askmyteapot commented Jun 9, 2024

askmyteapot commented Jun 11, 2024 • edited Loading

askmyteapot commented Jun 12, 2024 • edited Loading

askmyteapot commented Jun 12, 2024

LostRuins commented Jun 13, 2024

LostRuins left a comment

Choose a reason for hiding this comment

askmyteapot left a comment

Choose a reason for hiding this comment

askmyteapot commented Jun 11, 2024 •

edited

Loading

askmyteapot commented Jun 12, 2024 •

edited

Loading