Add NTK-Aware interpolation "by parts" correction #1

bloc97 · 2023-07-07T20:40:33Z

This PR adds the new and improved "by parts" correction to the NTK-aware interpolation method.

This corrected method improves from previous methods fourfold:

Decreases PPL in all context lengths when used on non-finetuned models compared to previous NTK-Aware method, especially for higher context sizes as alpha value can be set much lower for same context size.
Removes the alpha parameter, which did not accurately predict effective context length and was variable across different models. Now uses same scale parameter as linear interpolation which is much more intuitive and less prone to mistakes/misuse. (This was possible by fixing the alpha scale "drift" found in all LLaMA models)
Fixes the extrapolation regime that was breaking a lot of fine-tunes when alpha was set to a non-optimal value. Fine-tuning should be much easier and performance should in theory be increased significantly as there is no need to search for optimal alpha.
This method generalizes on both Extrapolation, NTK-Aware and Linear interpolation. For example, setting ntk_factor and extrapolation_factor to 0 will yield identical results to linear interpolation.

scale parameter should be used the same as linear interpolation. (eg. scale=2 is 2048 base ctx extended to 4096)
extrapolation_factor and ntk_factor are used for validation purposes, and should not be changed unless it is necessary.
Edit: ~~Also max_position_embeddings is assumed to be the original pretrained model context size! Leave it at 2048 for LLaMA models, changing it will break the code...~~ Fixed, added original_max_position_embeddings parameter to avoid any confusion

Comparison of new corrected NTK-Aware method to previous non-corrected NTK-Aware method. Note the new scale factor is still called alpha in this graph.

Now all is left is to validate this by finetuning!

Add new and better ntk scaled rope

bloc97 added 2 commits July 7, 2023 16:24

Create LlamaPartNTKScaledRotaryEmbedding.py

1f8d29c

Add new and better ntk scaled rope

Add original_max_position_embeddings parameter

8129a54

bloc97 mentioned this pull request Jul 9, 2023

Llama/GPTNeoX: add RoPE scaling huggingface/transformers#24653

Merged

jquesnelle merged commit 1e88da2 into jquesnelle:master Jul 9, 2023

EyeDeck mentioned this pull request Jul 10, 2023

How to use >2k context size? turboderp/exllama#147

Closed

IgnacioFDM mentioned this pull request Jul 11, 2023

Implement customizable RoPE ggerganov/llama.cpp#2054

Merged

Panchovix mentioned this pull request Jul 15, 2023

Increase alpha value limit for NTK RoPE scaling for exllama/exllama_HF oobabooga/text-generation-webui#3149

Merged

EliEron mentioned this pull request Jul 18, 2023

16k+ context upgrade - Long-range Falcon cmp-nct/ggllm.cpp#65

Merged

bloc97 deleted the patch-1 branch September 1, 2023 14:42

gante mentioned this pull request May 27, 2024

Add YaRN and Dynamic-YaRN RoPE Scaling Methods huggingface/transformers#30910

Merged

5 tasks

joecummings mentioned this pull request Jul 16, 2024

[RFC] Adding RoPE scaling methods to support long context modeling pytorch/torchtune#1183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NTK-Aware interpolation "by parts" correction #1

Add NTK-Aware interpolation "by parts" correction #1

bloc97 commented Jul 7, 2023 •

edited

Loading

Add NTK-Aware interpolation "by parts" correction #1

Add NTK-Aware interpolation "by parts" correction #1

Conversation

bloc97 commented Jul 7, 2023 • edited Loading

bloc97 commented Jul 7, 2023 •

edited

Loading