Default for RMS epsilon #2384

ikawrakow · 2023-07-25T12:42:46Z

PR #2374 made the epsilon used the rms_norm operation to be a parameter with a default value of 1e-6 that can be changed via a command line option. This is to account for the fact that in LLaMA-1 epsilon = 1e-6 was used during training, while LLaMA-2 was trained with epsilon = 1e-5. Using epsilon = 1e-6 for LLaMA-2 results in significantly lower perplexity scores (see #2373, #2352, and the graphs below), so being able to change RMS epsilon via the command line is great. Until one has to run many perplexity calculations, as I'm currently doing in an attempt to improve the Q4_K performance for LLaMA-2. In this process I'm making a change to the Q4_K quantization, then running perplexity for LLaMA-1 and LLaMA-2, frequently forgetting to modify the epsilon from the last run, and hence wasting a lot of time waiting for the calculation to finish to only then realize that I have used the wrong epsilon.

So, I decided to see if there is an epsilon that works equally well for both LLaMA models. Turns out epsilon = 5e-6 is this magic value. If anything, epsilon = 5e-6 slightly improves perplexity scores for LLaMA-2. For LLaMA-1, perplexity scores are the same as with epsilon = 1e-6 within 1.5 X maximum training context (so, up to and including 3072 tokens), and are only very slightly higher for context lengths of 4096 and beyond. This can be seen in the graphs below.

Given this finding, this PR adds a macro LLAMA_DEFAULT_RMS_EPS, set to 5e-6f. All rms_norm default values are set via this macro, which can be changed at build time (make clean && LLAMA_DEFAULT_RMS_EPS=YourPreferredChoice make) instead of being hard-coded to 1e-6. One can of course still change epsilon via the command line.

Figures show perplexity as a function of context size for the 7B and 13B LLaMA models. The black line+circles represent LLaMA-1 results with epsilon = 1e-6. The orange squares depict LLaMA-1 with epislon = 5e-6 (proposed default value for both models). Red line+circles are for LLaMA-2 with epsilon = 1e-6. They were computed before we had realized the rms_norm epsilon issue and are included in the graph to illustrate the magnitude of the perplexity loss due to using the wrong epsilon value. The blue line/circles show LLaMA-2 results for epsilon = 1e-5, and the magenta are for LLaMA-2 with epsilon = 5e-6. The calculations beyond the maximum training context were run with base RoPE frequency selected to minimize the perplexity score.

slaren

Nice!

LostRuins · 2023-07-25T14:39:22Z

I was messing around with my own benchmarks for this too, though I took a few shortcuts with the stuff I had on hand.
Here are perplexities for two very different models, but the thing to note is the relative perplexity values within the same model.

Mostly similar to these findings, although at least from my extremely limited tests Llama 1 handled 1e-5 fine too.

Interesting to note that Llama 2 seems to explode earlier despite having been trained with a higher rms_norm_eps. I don't know what's the significance of that.

Perplexity Comparison (1k tokens excerpt of wikitext, text is identical for all runs)
==================================================
Llama 1, Airoboros 7B,  rms_eps=1e-6		5.7252
Llama 1, Airoboros 7B,  rms_eps=2e-6		5.7330
Llama 1, Airoboros 7B,  rms_eps=5e-6		5.7305
Llama 1, Airoboros 7B,  rms_eps=1e-5		5.7119
Llama 1, Airoboros 7B,  rms_eps=2e-5		5.7840
Llama 1, Airoboros 7B,  rms_eps=5e-5		5.8257
Llama 1, Airoboros 7B,  rms_eps=1e-4		5.8801
Llama 1, Airoboros 7B,  rms_eps=2e-4		6.1153
Llama 1, Airoboros 7B,  rms_eps=5e-4		6.7065
Llama 1, Airoboros 7B,  rms_eps=1e-3		16.6464
Llama 1, Airoboros 7B,  rms_eps=2e-3		4659.6008
Llama 1, Airoboros 7B,  rms_eps=5e-3		4385.4315

Llama 2, Base Model 7B, rms_eps=1e-6		3.5891
Llama 2, Base Model 7B, rms_eps=2e-6		3.5718
Llama 2, Base Model 7B, rms_eps=5e-6		3.4989
Llama 2, Base Model 7B, rms_eps=1e-5		3.5022
Llama 2, Base Model 7B, rms_eps=2e-5		3.7621
Llama 2, Base Model 7B, rms_eps=5e-5		3.6156
Llama 2, Base Model 7B, rms_eps=1e-4		3.8805
Llama 2, Base Model 7B, rms_eps=2e-4		4.1537
Llama 2, Base Model 7B, rms_eps=5e-4		336.7881
Llama 2, Base Model 7B, rms_eps=1e-3		772.1516
Llama 2, Base Model 7B, rms_eps=2e-3		14316.2521
Llama 2, Base Model 7B, rms_eps=5e-3		12873.5310

(do note, this is a selected 1k token excerpt of wikitext, rather than the full corpus. That would be far too slow.)

ikawrakow · 2023-07-25T15:35:33Z

@LostRuins

Thanks for the comment. Yes, I can confirm that LLaMA-1 handles contexts up to 2048 tokens pretty well with epsilon = 1e-5. But when we go beyond 2048 tokens we start seeing larger differences. E.g., for the full wikitext, at 4096 tokens we have ppl = 5.4028 for epsilon = 1e-6, ppl = 5.4179 for epsilon = 5e-6 and ppl = 5.4354 for epsilon = 1e-5. At 8192 tokens we get ppl = 6.3578 for epsilon = 1e-6, ppl = 6.3968 for epsilon = 5e-6, and ppl = 6.4459 for epsilon = 1e-5. I.e., setting epsilon = 5e-6 as default halves the difference for LLaMA-1 and improves LLaMA-2, so it is a win-win.

Based on ggml-org/llama.cpp#2384

* Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * WIP: python class to write GGUF, incomplete C apı for reading --------- Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Add LLAMA_DEFAULT_RMS_EPS so we can change the default

055bee9

ikawrakow requested a review from slaren July 25, 2023 12:42

slaren approved these changes Jul 25, 2023

View reviewed changes

ikawrakow merged commit eb542d3 into master Jul 25, 2023

ikawrakow deleted the ik/llama_dfault_rms_eps branch July 25, 2023 15:35

oobabooga added a commit to oobabooga/text-generation-webui that referenced this pull request Jul 25, 2023

Change rms_norm_eps to 5e-6 for llama-2-70b ggml

7bc408b

Based on ggml-org/llama.cpp#2384

LostRuins referenced this pull request in LostRuins/koboldcpp Jul 26, 2023

a better default rms_norm_eps

0c26799

oobabooga mentioned this pull request Aug 10, 2023

Add Vicuna-v1.5 detection oobabooga/text-generation-webui#3524

Merged

1 task

slaren mentioned this pull request Aug 19, 2023

server : better default prompt #2646

Merged

ikawrakow mentioned this pull request Aug 22, 2023

Quantization improvements for k_quants #2707

Merged

Nexesenex mentioned this pull request Sep 10, 2023

Ideal Rope for CodeLLama 2 based models differs vastly from LLama 2. LostRuins/koboldcpp#426

Closed

cebtenzzre mentioned this pull request Nov 6, 2023

Convert fixes #3967

Closed

cebtenzzre mentioned this pull request Feb 8, 2024

bert: fix layer norm epsilon value nomic-ai/gpt4all#1946

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default for RMS epsilon #2384

Default for RMS epsilon #2384

ikawrakow commented Jul 25, 2023

slaren left a comment

LostRuins commented Jul 25, 2023 •

edited

Loading

ikawrakow commented Jul 25, 2023

Default for RMS epsilon #2384

Default for RMS epsilon #2384

Conversation

ikawrakow commented Jul 25, 2023

slaren left a comment

Choose a reason for hiding this comment

LostRuins commented Jul 25, 2023 • edited Loading

ikawrakow commented Jul 25, 2023

LostRuins commented Jul 25, 2023 •

edited

Loading