-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sending in tokens one at a time vs all at once gives different logits #4479
Comments
Same issue with 0 GPU layers? Does setting the batch size to 1 result in the same logits? |
The issue only seems to occur when GPU layers>=1 and n_batch>=2 No GPU layers or n_batch=1 makes the issue go away (results in the same logits) |
There has been some discussion in the past in this regard: #3014 |
Generally speaking different logits can be expected between batch size 1 and batch size 2 because they use different matrix multiplications kernels that are going to result in differences in rounding error. I am not observing significant differences in perplexity between batch size 1 and batch size 2. Furthermore if you actually calculate the token probabilities that arise from the logits posted here you find that they are essentially the same:
The biggest differences are on the first and last token which are not going to be sampled anyways because they get assigned ~0.01% probability, barely more than the ~0.003% probability that each token would have on average. The most likely token goes from 99.5% to 99.4%. My interpretation of what's happening is that those tokens that would make no sense anyways are going to be close to 0 in terms of activations and as such are more susceptible to noise from rounding error. Those tokens therefore can get large relative changes to their probabilities but because they are not going to be sampled from anyways it doesn't matter. |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
context [2]
n_past 0
context [2, 3]
n_past 1
Should result in the same logits as
context [2, 3]
n_past 0
(sending all tokens in at the same time)
Current Behavior
They provide different logits. They are similar, but different. I'm assuming this is due to KV-Cache, maybe it can't be fixed because of that. Feel free to close this issue, I just thought it was worth documenting. It may be related to this issue
Steps to Reproduce
Using llama-cpp-python (probably easy enough to do in c++ as well), just run
Gives
Failure Log:
I think this behavior has existed for a while, but here's my full outputs:
Full llama-cpp output log
Git log:
The text was updated successfully, but these errors were encountered: