Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default for RMS epsilon #2384

Merged
merged 1 commit into from
Jul 25, 2023
Merged

Default for RMS epsilon #2384

merged 1 commit into from
Jul 25, 2023

Conversation

ikawrakow
Copy link
Contributor

PR #2374 made the epsilon used the rms_norm operation to be a parameter with a default value of 1e-6 that can be changed via a command line option. This is to account for the fact that in LLaMA-1 epsilon = 1e-6 was used during training, while LLaMA-2 was trained with epsilon = 1e-5. Using epsilon = 1e-6 for LLaMA-2 results in significantly lower perplexity scores (see #2373, #2352, and the graphs below), so being able to change RMS epsilon via the command line is great. Until one has to run many perplexity calculations, as I'm currently doing in an attempt to improve the Q4_K performance for LLaMA-2. In this process I'm making a change to the Q4_K quantization, then running perplexity for LLaMA-1 and LLaMA-2, frequently forgetting to modify the epsilon from the last run, and hence wasting a lot of time waiting for the calculation to finish to only then realize that I have used the wrong epsilon.

So, I decided to see if there is an epsilon that works equally well for both LLaMA models. Turns out epsilon = 5e-6 is this magic value. If anything, epsilon = 5e-6 slightly improves perplexity scores for LLaMA-2. For LLaMA-1, perplexity scores are the same as with epsilon = 1e-6 within 1.5 X maximum training context (so, up to and including 3072 tokens), and are only very slightly higher for context lengths of 4096 and beyond. This can be seen in the graphs below.

Given this finding, this PR adds a macro LLAMA_DEFAULT_RMS_EPS, set to 5e-6f. All rms_norm default values are set via this macro, which can be changed at build time (make clean && LLAMA_DEFAULT_RMS_EPS=YourPreferredChoice make) instead of being hard-coded to 1e-6. One can of course still change epsilon via the command line.

ppl_vs_ctx_7B

ppl_vs_ctx_13B

Figures show perplexity as a function of context size for the 7B and 13B LLaMA models. The black line+circles represent LLaMA-1 results with epsilon = 1e-6. The orange squares depict LLaMA-1 with epislon = 5e-6 (proposed default value for both models). Red line+circles are for LLaMA-2 with epsilon = 1e-6. They were computed before we had realized the rms_norm epsilon issue and are included in the graph to illustrate the magnitude of the perplexity loss due to using the wrong epsilon value. The blue line/circles show LLaMA-2 results for epsilon = 1e-5, and the magenta are for LLaMA-2 with epsilon = 5e-6. The calculations beyond the maximum training context were run with base RoPE frequency selected to minimize the perplexity score.

@ikawrakow ikawrakow requested a review from slaren July 25, 2023 12:42
Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@LostRuins
Copy link
Collaborator

LostRuins commented Jul 25, 2023

I was messing around with my own benchmarks for this too, though I took a few shortcuts with the stuff I had on hand.
Here are perplexities for two very different models, but the thing to note is the relative perplexity values within the same model.

Mostly similar to these findings, although at least from my extremely limited tests Llama 1 handled 1e-5 fine too.

Interesting to note that Llama 2 seems to explode earlier despite having been trained with a higher rms_norm_eps. I don't know what's the significance of that.

Perplexity Comparison (1k tokens excerpt of wikitext, text is identical for all runs)
==================================================
Llama 1, Airoboros 7B,  rms_eps=1e-6		5.7252
Llama 1, Airoboros 7B,  rms_eps=2e-6		5.7330
Llama 1, Airoboros 7B,  rms_eps=5e-6		5.7305
Llama 1, Airoboros 7B,  rms_eps=1e-5		5.7119
Llama 1, Airoboros 7B,  rms_eps=2e-5		5.7840
Llama 1, Airoboros 7B,  rms_eps=5e-5		5.8257
Llama 1, Airoboros 7B,  rms_eps=1e-4		5.8801
Llama 1, Airoboros 7B,  rms_eps=2e-4		6.1153
Llama 1, Airoboros 7B,  rms_eps=5e-4		6.7065
Llama 1, Airoboros 7B,  rms_eps=1e-3		16.6464
Llama 1, Airoboros 7B,  rms_eps=2e-3		4659.6008
Llama 1, Airoboros 7B,  rms_eps=5e-3		4385.4315

Llama 2, Base Model 7B, rms_eps=1e-6		3.5891
Llama 2, Base Model 7B, rms_eps=2e-6		3.5718
Llama 2, Base Model 7B, rms_eps=5e-6		3.4989
Llama 2, Base Model 7B, rms_eps=1e-5		3.5022
Llama 2, Base Model 7B, rms_eps=2e-5		3.7621
Llama 2, Base Model 7B, rms_eps=5e-5		3.6156
Llama 2, Base Model 7B, rms_eps=1e-4		3.8805
Llama 2, Base Model 7B, rms_eps=2e-4		4.1537
Llama 2, Base Model 7B, rms_eps=5e-4		336.7881
Llama 2, Base Model 7B, rms_eps=1e-3		772.1516
Llama 2, Base Model 7B, rms_eps=2e-3		14316.2521
Llama 2, Base Model 7B, rms_eps=5e-3		12873.5310

(do note, this is a selected 1k token excerpt of wikitext, rather than the full corpus. That would be far too slow.)

@ikawrakow
Copy link
Contributor Author

@LostRuins

Thanks for the comment. Yes, I can confirm that LLaMA-1 handles contexts up to 2048 tokens pretty well with epsilon = 1e-5. But when we go beyond 2048 tokens we start seeing larger differences. E.g., for the full wikitext, at 4096 tokens we have ppl = 5.4028 for epsilon = 1e-6, ppl = 5.4179 for epsilon = 5e-6 and ppl = 5.4354 for epsilon = 1e-5. At 8192 tokens we get ppl = 6.3578 for epsilon = 1e-6, ppl = 6.3968 for epsilon = 5e-6, and ppl = 6.4459 for epsilon = 1e-5. I.e., setting epsilon = 5e-6 as default halves the difference for LLaMA-1 and improves LLaMA-2, so it is a win-win.

@ikawrakow ikawrakow merged commit eb542d3 into master Jul 25, 2023
@ikawrakow ikawrakow deleted the ik/llama_dfault_rms_eps branch July 25, 2023 15:35
oobabooga added a commit to oobabooga/text-generation-webui that referenced this pull request Jul 25, 2023
LostRuins referenced this pull request in LostRuins/koboldcpp Jul 26, 2023
ggerganov pushed a commit that referenced this pull request Jul 26, 2023
* Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* WIP: python class to write GGUF, incomplete C apı for reading

---------

Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
ggerganov pushed a commit that referenced this pull request Jul 26, 2023
* Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* WIP: python class to write GGUF, incomplete C apı for reading

---------

Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
ggerganov pushed a commit that referenced this pull request Jul 26, 2023
* Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* WIP: python class to write GGUF, incomplete C apı for reading

---------

Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@cebtenzzre cebtenzzre mentioned this pull request Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants