-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cublas NaNs in Falcon #2765
Conversation
Dont know how to solve this. Any suggestions? without blas: All seems to work fine cublas: |
These changes does not avoid NaNs completely when using cublas,, but decreases the probability of getting them. |
We should understand fully where the NaNs come from. |
The NaNs come from the input to the gelu function. I tried limit the input to the suggested range -10 to 10 and even lower but there was still NaNs. My guess is that this is compounding precision errors somewhere else, it is hard to debug this. The original Falcon is BF16 and when converted to F32 all outliers will be there, but will get cutoff if converted to F16. If converted to F16 instead of F32 prior to quantization the probability of NaNs seems to be lower. The NaN problem here seems related to cublas. Without blas it seems to work. |
The only difference is using the Falcon-7b HellaSwag output: cublas with mmq:
cublas wihtout mmq:
without blas:
|
Tested using |
This may be because with mmq |
@klosax Are you offloading tensors? We know that offloading KV tensors currently does not work with Falcon If you are not offloading, then I don't think |
The CUDA backend will automatically copy the weights when processing large prompts, even if not offloaded. Like the original cuBLAS implementation did. |
No I do not offload any tensors. I will do more tests comparing cublas with and without |
Falcon-7b HellaSwag output: Using cublas with mmq
Using cublas without mmq (
|
NaNs seems to appear with MMQ only when Q4_0 tensors are involved. |
I will close this. |
To better avoid getting NaN, this changes the single precision
tanhf
to double precisiontanh
.