-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantitative measurement of model perplexity for different models and model quantization modes #129
Comments
With quantization the result is also bad:
|
You might not be comparing apples to apples. e.g. are the |
I'm using the default settings, so for the Python code it is:
And for llama.cpp:
So I think only the repeat penalty and |
If I disable the repeat penalty (I assume
(These results are from the quantized 7B model with seed 0 and default parameters except for the repeat penalty) |
It does then look like |
I haven't tested the Python implementation extensively, because Facebook's implementation takes very long to run on my CPU. But I generally feel that running 7B and even 13B with llama.cpp gives results that are below the quality that Facebook has claimed. |
Try the following parameters, gives me good quality output:
Also repeat_penalty = 1.0 means disable. Maybe its not named as it should be 😇 |
If 1 means disable whats the point of higher than 1 values? Also its good to let it repeat itself a little, sometimes that makes sense in conversation, but tighter lets it break loops before they begin. |
Still gives me a wrong result with the quantized model:
With the fp16 model it is also wrong:
I think the problem is more fundamental than just a change of the parameters. |
It may be simply a case of the project management triangle, i.e. choose any two of:
|
That might be so, but I don't see an obvious reason why the quality would be lower. Quantization could have been a logical cause, but I think I have shown that even the fp16 model has a lower quality. |
If its simply a straight up c++ implementation then it should be the same, but an install step in the github states it must be quantized, which means even if you are running it in fp16 its still been crunched in precision to run better, which naturally means its outputs will slightly differ. You wouldnt expect a mile long road at 18.2 degrees to end up at the same place as one rebuilt at 18.0 degrees, right? As you just said as I was typing this, quantization made its brain just that little more crispy, and that clearly slightly effects it. Thats probably not solvable. |
I don't think that step is required. The model runs fine without the quantization step. And the readme also claims llama.cpp has "Mixed F16/F32 precision". Edit: there's an example of running without quantization here: #2 (comment) |
@Urammar higher than 1 starts to penalize the predicted next token if it occurred in previous N tokens. It will multiply the likelihood by 1/penalty. |
Try like so:
EDIT Love how my brain failed at interpreting this let me try larger model. |
For me it consistently answers incorrectly every time
|
Haha, that question about ducks is also interesting. Using this prompt:
The Python implementation outputs a plausible answer:
But llama.cpp 7B FP16 outputs garbage:
|
I get consistently non-garbage output. Can you try using the settings I had above? I am on a different branch. Wonder if that has anything to do with it. |
I mean it even explains itself
I'm not sure if this is the best way to objectively tell the quality of the output :) EDIT cmd line params:
|
Actually, after playing around a bit with the quantized model, I now believe that the problem is only in running the FP16 model. The quantized model seems to work much better for me. |
Thanks for sharing your parameters guys, I get definitely better results as with the default ones. I ran the same prompt with 5 different models 7B/Q4, 7B/16F, 13B/Q4, 13B/16F, 30B/Q4:
See the results below. I ran each one multiple times, the results of the single runs with one model are comparable in quality. I don't see a major quality difference in Q4 and F16. Interestingly, 13B gave me the weirdest results. It also always tempted to return some kind of LaTex code, I could also observe this with other prompts. Results7B / Q4
7B / F16
13B / Q4
13B / F16
30B / Q4
|
@sburnicki I think it is better to include the Also, in hindsight I think I should have worded it slightly differently:
|
One bug I found is #173 llama.cpp seems to use a different norm method. |
So I think the quality of the generations is difficult to evaluate. We need a more quantitative metric. |
This blog post did a quantitative measurement of quality for comparison against different quantification methods, though I don't know how well it corresponds to subjective quality. Code is here though I'm not sure if it includes the quality measurement code. There is also this project which does include measurement code. I expect one of these could serve as a starting point for building an objective metric for quality. |
Here's more evidence for the low quality. Prompt:
(The last line has two spaces) Python LLaMA 7Bseed 1:
seed 2:
seed 3:
llama.cpp 7B F16seed 1:
seed 2:
(This one is pretty good) seed 3:
llama.cpp 7B Q4_0seed 1:
seed 2:
(This one would be correct if not for that seed 3:
results
|
Can we rule out the tokenizer? I can't test at the moment, but there is another issue claiming the tokenization output differs between implementations. |
I get these.
So only 2/5 are correct programs. I've also done the first 20 seeds of which it got 10/20 (50%) correct. I've just reran this prompt on the Python implementation and it got 14/20 seeds (70%) correct. |
@ggerganov I actually get better results with the default parameters:
That produces correct fibonacci programs for 13/20 seeds. For good measure the Q4_0 with default parameters and |
@noughtmare did you try --repeat_penalty 1.176 (~ 1.0/0.85), i have seen that value multiple times and have been using it myself and seems to be the best for conversations. |
@Green-Sky if I use I suspect it doesn't want to do indentation because that is repeated on every line and it doesn't want to do recursive calls because that is also a repetition of the function name. Here's the full output:
|
@noughtmare very interesting. I suspect the raw models are just very bad at following a structure in zero-shot tasks like this. |
I am testing this as well. I have the following invocation. I built the Q4_1 files out of interest because they quantize the matrices to all 16 different values whereas the Q4_0 only uses 15 possible quantizations, so I figured it might work better. I think the key difference is not that _1 has more values but that Q4_1 has no representation for zero, whereas Q4_0's 4-bit value 8 encodes a 0 in the weight matrix. This sort of thing obviously has massive implications for the model. As an example: $ ./main -s 400 --repeat_penalty 1.0 -m models/7B/ggml-model-q4_1.bin --top_p 0.7 --top_k 40 --temp 0.1 -p "Here's a Python program that computes fibonacci numbers:
def fib(n):
"
Here's a Python program that computes fibonacci numbers:
def fib(n):
if n == 0:
return 0
if n == 1:
return 1
return fib(n-1) + fib(n-2)
print(fib(100000)) I'd say that Q4_1 always writes this answer if temperature is set low, regardless of the seed. I tried a few dozen and couldn't get anything different to come out. At most the example at print(fib(...)) appears to vary sometimes, as does the "discussion" that follows the example print() invocation. Interestingly, Q4_0 prefers this version, which won't have fib(0) = 0: $ ./main -s 400 --repeat_penalty 1.0 -m models/7B/ggml-model-q4_0.bin --top_p 0.7 --top_k 1000 --temp 0.1 -p "Here's a Python program that computes fibonacci numbers:
def fib(n):
"
Here's a Python program that computes fibonacci numbers:
def fib(n):
if n == 0 or n == 1:
return 1
else:
return fib(n-1) + fib(n-2) As a general observation, I would say that both top_p=0.9 and high temperature tend to take the model's output off the rails, and it usually begins to prattle something completely nonsensical. That being said, my rather strong impression is that Q4_1 does produce higher quality output than Q4_0, though this is not proven by any kind of actual perplexity analysis. It's just my observation from using the same arguments and asking it to do various creative exercises. Q4_0 often seems to ignore the instruction and writes something else, whereas Q4_1 can be on topic. Still, this sort of claim should be more rigorously proven. As to the picture question, seed 1, seed 2, seed 3, seed 4 results all say:
My thinking is that this question is known to the model and using a low temperature allows predicting the answer correctly every time. It makes no difference whether this is Q4_0 or Q4_1 answering the question. |
there is an ongoing discussion #270 about perplexity here. did you cross reference with the base f16 model? |
I am unable to execute the f16 base model due to only having up to 16 GB memory laptop models to run the program on. It seems that it would take about 14 GB of usable memory to run the model. Still, your comment in #270 appears to show that perplexity is improved by Q4_1, which I think is quite evident from the output of the model, especially once the output's randomness is reduced. I am not able to explain the higher quality of facebookresearch's version over this one, which is what started this thread. Taking a look, they also use a simple top_p sampling strategy, which can be approximated by setting repeat_penalty=1.0 and top_k to some high value like 10000. But in my experience, quality is generally dropping below acceptable after temperature exceeds about 0.3. This might be due to the damage done by the 4-bit quantization, as far as I can tell. There is also some nondeterminism in the program. This is likely due to how GGLM farms the computation across threads, likely splitting the matrices by rows or whatever, and this results in different accumulation and rounding errors, perhaps, which slightly appears to affect the results. For precise repeatable results, it is not enough to get the same output sampling seed, but also the same input batching and thread count, I think. |
@alankila you can reduce memory using 16bit floats for memory |
Your observations and analysis correspond very well to mine. Your analysis about the determinism is correct - I am thinking about how to fix this, but probably a major change in I just realized my intuition about temperature was wrong somehow -- was thinking low temp means more random. It's the other way around 🤦 @glinscott |
@ggerganov to clarify my comments - #252 was a gigantic improvement in perplexity :). It went from 10.4625 to 5.9565 using 16f, which is huge. Some concrete results comparing f16 to q4_0 are in the updated PR description on #270. q4_0 seems to hurt perplexity a bit, but it's certainly not disastrous. I'm doing a q4_1 run now to compare.
|
Btw, q4_1 is quite a lot slower for me at 32 threads on 512 context window size:
vs, q4_0:
|
Just a FYI that people have experienced optimal performance by running with a number of threads equal to the number of cores, i.e. hyperthreading doesn't seem to help as with > 4 cores memory performance starts to become the performance bottleneck. Also, I have 16 cores, 128GB of RAM (just enough to run 65B at fp16) and all the latest models sitting idle under my desk, so if someone needs some quality or performance benchmarking run please point me to a release and specify the test suite you would like me to run. |
@gjmulder I must say it would be fun to have a big model with the AI assistant prompt cooperate on some kind of creative exercise. I think that the 7B model can already do quite well, and I am dying to see how the 30B model does -- and I have absolutely no way to run it myself. I know that my way of going about this is almost painfully unscientific, and probably not what you were offering to do. However, my excuse is that this is all pretty new to me and the novelty of these language models has not worn off for me yet. In fact, I have mostly ignored my real job today in favour of chatting with an AI. To whet your appetite -- if you are anything like me -- here is a little chat transcription that I just had with "Jane" using the current master branch version. exec ./main -t 4 --memory_f16 -m ./models/7B/ggml-model-q4_1.bin --ctx_size 2048 --temp 0.7 --top_k 1000 --top_p 0.75 --repeat_last_n 30 --repeat_penalty 1.2 -n 2048 --color -r "Alan:" -r "Alan (" -p "$prompt" The prompt runs up to the first 2 Jane dialog turns, like this:
I don't know what kind of results other people are getting, but I am just astonished to see this and a hundred other chats like it coming out of some 5 GB Linux process at human typing speed. Unfortunately, it is fairly obvious to me that the 7B model has to repeat a lot of what I say, and I am hoping that this is not an issue with the bigger models. I use "Jane (emotional state or actions here):" to serve as writeable memory of the model's state, if you will. My thinking is this helps maintaining coherency in the hallucinated persona. When the model opts to use these parenthesis, Jane is frequently angry, in tears, laughs, is excited, smiles -- all appropriate human-like responses to what I say. As an example, if I insult her for no reason, she gets upset, then angry, and even quits the chat on me by generating the end of stream token! Unfortunately, sometimes it generates an open parenthesis on my prompt and I just close it, and the language model then repeats it. |
@glinscott might be a while - or a week - considering the below estimate with the 30B model. Perhaps we need to be very specific in the A-B tests we need answered?
|
@gjmulder yeah, if it's going to take too long, we could run up to 250 chunks or so, that's a pretty good approximation of the final perplexity so far. |
@glinscott i did not investigate yet, but running perplexity increases memory usage by more then 2x of the model size. |
Would be good to test the perplexity with the GPTQ quantization and compare with the usual RTN quantization. https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py |
Perplexity scores for various model sizes and quantizations are now being collected in Discussion #406. |
An interesting comparison with alpaca models is available on reddit |
It looks like the difference between f16 and 4b are exactly the same after 50 Chunk That would mean that if we know the perplexity of f16 (after a full run) we can know the perplexity of 4b just after 50 Chunk Just my 2 cents 😅 |
This issue has been moved to #406 |
llama.cpp seems to give bad results compared to Facebook's implementation.
Here's an example simple reading comprehension prompt:
LLaMA 7B with Facebook's implementation yields:
Seed
1
:Seed
2
(to show that the above is not just a fluke):While llama.cpp without quantization (so still float16) generates (with
--seed 0 -t 8
):It even has a grammatical error at the end: "one out [of] three"
As you can see the quality of 7B is higher in Facebook's implementation. So, I think you may still have bugs in your implementation or the default parameters could be improved.
The text was updated successfully, but these errors were encountered: