-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default for RMS epsilon #2384
Default for RMS epsilon #2384
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I was messing around with my own benchmarks for this too, though I took a few shortcuts with the stuff I had on hand. Mostly similar to these findings, although at least from my extremely limited tests Llama 1 handled 1e-5 fine too. Interesting to note that Llama 2 seems to explode earlier despite having been trained with a higher rms_norm_eps. I don't know what's the significance of that.
(do note, this is a selected 1k token excerpt of wikitext, rather than the full corpus. That would be far too slow.) |
Thanks for the comment. Yes, I can confirm that LLaMA-1 handles contexts up to 2048 tokens pretty well with |
* Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * WIP: python class to write GGUF, incomplete C apı for reading --------- Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * WIP: python class to write GGUF, incomplete C apı for reading --------- Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * WIP: python class to write GGUF, incomplete C apı for reading --------- Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
PR #2374 made the
epsilon
used therms_norm
operation to be a parameter with a default value of1e-6
that can be changed via a command line option. This is to account for the fact that in LLaMA-1epsilon = 1e-6
was used during training, while LLaMA-2 was trained withepsilon = 1e-5
. Usingepsilon = 1e-6
for LLaMA-2 results in significantly lower perplexity scores (see #2373, #2352, and the graphs below), so being able to change RMSepsilon
via the command line is great. Until one has to run many perplexity calculations, as I'm currently doing in an attempt to improve theQ4_K
performance for LLaMA-2. In this process I'm making a change to theQ4_K
quantization, then running perplexity for LLaMA-1 and LLaMA-2, frequently forgetting to modify theepsilon
from the last run, and hence wasting a lot of time waiting for the calculation to finish to only then realize that I have used the wrongepsilon
.So, I decided to see if there is an
epsilon
that works equally well for both LLaMA models. Turns outepsilon = 5e-6
is this magic value. If anything,epsilon = 5e-6
slightly improves perplexity scores for LLaMA-2. For LLaMA-1, perplexity scores are the same as withepsilon = 1e-6
within 1.5 X maximum training context (so, up to and including 3072 tokens), and are only very slightly higher for context lengths of 4096 and beyond. This can be seen in the graphs below.Given this finding, this PR adds a macro
LLAMA_DEFAULT_RMS_EPS
, set to5e-6f
. Allrms_norm
default values are set via this macro, which can be changed at build time (make clean && LLAMA_DEFAULT_RMS_EPS=YourPreferredChoice make
) instead of being hard-coded to1e-6
. One can of course still changeepsilon
via the command line.Figures show perplexity as a function of context size for the 7B and 13B LLaMA models. The black line+circles represent LLaMA-1 results with
epsilon = 1e-6
. The orange squares depict LLaMA-1 withepislon = 5e-6
(proposed default value for both models). Red line+circles are for LLaMA-2 withepsilon = 1e-6
. They were computed before we had realized therms_norm epsilon
issue and are included in the graph to illustrate the magnitude of the perplexity loss due to using the wrongepsilon
value. The blue line/circles show LLaMA-2 results forepsilon = 1e-5
, and the magenta are for LLaMA-2 withepsilon = 5e-6
. The calculations beyond the maximum training context were run with base RoPE frequency selected to minimize the perplexity score.