-
-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
q4_matmul_cuda
kernel does not yield reproducible results
#153
Comments
The q4 matmul kernel isn't strictly deterministic due to the non-associativity of floating-point addition and CUDA providing no guarantees about the order in which blocks in a grid are processed. It's essentially an artifact of relying on Ways to mitigate it would be either switching to a reduction method (as used by cuBLAS in the reconstruction version) or computing an intermediate result in FP32 before downcasting to FP16. Both of those methods add VRAM and compute overhead, though, so it wouldn't make sense without first establishing that a difference on the order of 0.04% actually matters. I would question the use of quantization at all for applications where it does, given there's a significantly greater loss going from FP16 to GPTQ in the first place. |
Thanks, indeed it looks to be the atomicAdd: https://forums.developer.nvidia.com/t/get-different-results-for-every-running-with-atomicadd/229649/2 I haven't seen a case where such a small difference matters, I just catched it in my CI and wondered why my logits slightly differed. Thanks! |
Hi,
I see a slight deviation in the output of
q4_matmul_cuda
between diffferents calls with the same input. Is it expected? If so, why?The absolute deviation is in the order of 0.04%, and from what's I've seen it does not influence the generated output. Simply the logits differ.
The issue does not happen when calling instead
q4_matmul_recons_cuda
(just changeinp = torch.rand(1, 1, hidden_size, dtype=torch.float16).to(device)
toinp = torch.rand(1, 9, hidden_size, dtype=torch.float16).to(device)
in the example below).Related: #73
Reproduction:
Download https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ to a local repository, and then:
Result:
Edit: could it be because of a different
atomicAdd
order?The text was updated successfully, but these errors were encountered: