Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cublas NaNs in Falcon #2765

Closed
wants to merge 3 commits into from
Closed

Conversation

klosax
Copy link
Contributor

@klosax klosax commented Aug 24, 2023

To better avoid getting NaN, this changes the single precision tanhf to double precision tanh.

ggml.c Outdated Show resolved Hide resolved
@klosax
Copy link
Contributor Author

klosax commented Aug 24, 2023

Dont know how to solve this. Any suggestions?

without blas: All seems to work fine

cublas:
With master: F32 output tensor works fine, but not Q4_0
With this PR: Q4_0 output tensor works fine, but not F32
All other tensors Q4_0

ggml.c Show resolved Hide resolved
ggml.c Show resolved Hide resolved
@klosax
Copy link
Contributor Author

klosax commented Aug 25, 2023

These changes does not avoid NaNs completely when using cublas,, but decreases the probability of getting them.

@klosax klosax requested a review from ggerganov August 25, 2023 09:38
@ggerganov
Copy link
Owner

We should understand fully where the NaNs come from.
If you keep //#define GGML_GELU_FP16 as you have proposed and also apply @ikawrakow's suggestions, do you observe NaNs in any case?

@klosax
Copy link
Contributor Author

klosax commented Aug 25, 2023

The NaNs come from the input to the gelu function. I tried limit the input to the suggested range -10 to 10 and even lower but there was still NaNs. My guess is that this is compounding precision errors somewhere else, it is hard to debug this. The original Falcon is BF16 and when converted to F32 all outliers will be there, but will get cutoff if converted to F16. If converted to F16 instead of F32 prior to quantization the probability of NaNs seems to be lower.

The NaN problem here seems related to cublas. Without blas it seems to work.

@klosax
Copy link
Contributor Author

klosax commented Aug 25, 2023

The only difference is using the LLAMA_CUBLAS compile parameter

Falcon-7b
HF model converted to F32
All 2D tensors quantized to q4_0

HellaSwag output:

cublas with mmq:

309	72.49190939
310	72.25806452
311	72.34726688
312	72.43589744
NAN: task 313 ending 0 logprob nan

cublas wihtout mmq:

309	73.46278317
310	73.22580645
311	73.31189711
312	73.3974359
313	73.48242812
314	73.56687898
315	73.65079365
..
400	74.5

without blas:

309	73.78640777
310	73.54838710
311	73.63344051
312	73.71794872
313	73.80191693
314	73.88535032
315	73.96825397
..
400	74.5

@klosax
Copy link
Contributor Author

klosax commented Aug 25, 2023

Tested using -nommq and it seems to work as good as without blas. Are there bugs in the custom cuda mul mat q kernels?

@slaren
Copy link
Collaborator

slaren commented Aug 25, 2023

Are there bugs in the custom cuda mul mat q kernels?

This may be because with mmq src1 is quantized to q8_1, so there is some loss of precision. With CPU it is also quantized to q8_0 or q8_1, so the result should be similar.

@ggerganov
Copy link
Owner

@klosax Are you offloading tensors? We know that offloading KV tensors currently does not work with Falcon

If you are not offloading, then I don't think -nommq should make any difference, no?

@slaren
Copy link
Collaborator

slaren commented Aug 25, 2023

If you are not offloading, then I don't think -nommq should make any difference, no?

The CUDA backend will automatically copy the weights when processing large prompts, even if not offloaded. Like the original cuBLAS implementation did.

@klosax
Copy link
Contributor Author

klosax commented Aug 25, 2023

No I do not offload any tensors.

I will do more tests comparing cublas with and without -nommq. Maybe it is just me, but when testing things out it would really have helped a lot to have a mode using only CPU full float precision as a reference, even if the source tensors are quantized.

@klosax
Copy link
Contributor Author

klosax commented Aug 25, 2023

Falcon-7b
HF model converted to F32
Output tensor F32
All other tensors quantized to q4_0

HellaSwag output:

Using cublas with mmq

118	75.42372881
119	75.63025210
120	75.00000000
121	75.20661157
122	75.40983607
123	75.60975610
124	75.00000000
125	74.40000000
126	74.60317460
NAN: task 127 ending 3 logprob nan

Using cublas without mmq (-nommq)

121	74.38016529
122	74.59016393
123	74.79674797
124	74.19354839
125	73.60000000
126	73.80952381
127	74.01574803
128	73.43750000
..
400	77.0

@klosax
Copy link
Contributor Author

klosax commented Aug 26, 2023

NaNs seems to appear with MMQ only when Q4_0 tensors are involved.
If there is no bug in the Q4_0 MMQ, then maybe disable Q4_0 MMQ when using Falcon?

@klosax klosax changed the title Fix gelu NaNs Fix cublas NaNs in Falcon Aug 26, 2023
@klosax
Copy link
Contributor Author

klosax commented Aug 27, 2023

I will close this.
The NaN problems disappear when disabling cuda mmq using -nommq and is not directly related to the precison of tanh. The MMQ is too inaccurate for the Falcon model, which leads to accumulating precision errors that will result NaNs. My suggestion is disabling MMQ when a Falcon model is loaded.

@klosax klosax closed this Aug 27, 2023
@klosax klosax deleted the fix-gelu-nans branch August 27, 2023 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants