ggml : testing GPU FP precision via quantized CPY #4698

ggerganov · 2023-12-30T11:49:04Z

Wanted to find out why the Metal test-backend-ops is failing from time to time. This PR applies just the GGML_OP_CPY operation with F32 src -> Q4_1 dst. I.e. it performs quantization.

Running this long enough will eventually generate an error:

make -j tests && while ./tests/test-backend-ops -o CPY -b Metal ; do date ; done

# example (scroll down)

  CPY(type_src=f32,type_dst=q4_1,ne=[32,1,1,1]): ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.03 MiB, (    1.66 / 147456.00)
d =  0.127808
m = -0.954590
d =  0.127808
m = -0.954590
OK
  CPY(type_src=f32,type_dst=q4_1,ne=[32,1,1,1]): ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.03 MiB, (    1.66 / 147456.00)
d =  0.126709
m = -0.922852
d =  0.126709
m = -0.922852
OK
  CPY(type_src=f32,type_dst=q4_1,ne=[32,1,1,1]): ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.03 MiB, (    1.66 / 147456.00)
d =  0.115479
m = -0.740723
d =  0.115479
m = -0.740723
OK
  CPY(type_src=f32,type_dst=q4_1,ne=[32,1,1,1]): ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.03 MiB, (    1.66 / 147456.00)
d =  0.119568                             # <------ GPU result (notice d is different here)
m = -0.834961
d =  0.119507                             # <------ CPU result (reference)
m = -0.834961
[CPY] NMSE = 0.000001 
    0 -0.834961 -0.834961, diff =  0.000000
    1 -0.834961 -0.834961, diff =  0.000000
    2 -0.476257 -0.476440, diff =  0.000183
    3  0.599854  0.599121, diff =  0.000732
    4  0.002014  0.001587, diff =  0.000427
    5  0.958557  0.957642, diff =  0.000916
    6 -0.117554 -0.117920, diff =  0.000366
    7  0.002014  0.001587, diff =  0.000427
    8  0.719421  0.718628, diff =  0.000793
    9  0.241150  0.240601, diff =  0.000549
   10  0.121582  0.121094, diff =  0.000488
   11  0.599854  0.599121, diff =  0.000732
   12  0.121582  0.121094, diff =  0.000488
   13  0.241150  0.240601, diff =  0.000549
   14  0.121582  0.121094, diff =  0.000488
   15  0.719421  0.718628, diff =  0.000793
   16 -0.356689 -0.356934, diff =  0.000244
   17  0.002014  0.001587, diff =  0.000427
   18  0.121582  0.121094, diff =  0.000488
   19  0.002014  0.001587, diff =  0.000427
   20  0.360718  0.360107, diff =  0.000610
   21 -0.117554 -0.117920, diff =  0.000366
   22  0.241150  0.240601, diff =  0.000549
   23  0.838989  0.838135, diff =  0.000854
   24  0.599854  0.599121, diff =  0.000732
   25 -0.476257 -0.476440, diff =  0.000183
   26 -0.117554 -0.117920, diff =  0.000366
   27  0.241150  0.240601, diff =  0.000549
   28 -0.595825 -0.595947, diff =  0.000122
   29  0.002014  0.001587, diff =  0.000427
   30  0.719421  0.718628, diff =  0.000793
   31 -0.237122 -0.237427, diff =  0.000305

It looks like the floating-point operation for computing d can produce different results between the CPU and the GPU:

const float d  = (max - min) / ((1 << 4) - 1);

Not sure how to fix this - ideas?

While looking into this issue, I found that CUDA can also fail the CPY test sometimes.
On master, run this and it will eventually fail, though I haven't investigated what is the source of the error in this case:

LLAMA_CUBLAS=1 make -j tests && while ./tests/test-backend-ops -o CPY -b CUDA0 ; do date ; done

Backend 2/2 (CUDA0)
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
  Backend name: CUDA
  CPY(type_src=f32,type_dst=f32,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=f16,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=q4_0,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=q4_1,ne=[256,10,10,1]): [CPY] NMSE = 0.000004 FAIL
  CPY(type_src=f32,type_dst=q5_0,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q5_1,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q8_0,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=q2_K,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q3_K,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q4_K,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q5_K,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q6_K,ne=[256,10,10,1]): not supported [CUDA] 
  876/877 tests passed
  Backend CUDA: FAIL

1/2 backends passed
FAIL

Sometimes it can take a while. Reproed on RTX 2060 and V100

slaren · 2023-12-30T12:41:19Z

It may be due to the floating point rounding mode. On the CPU it should be round-to-nearest by default. CUDA also supports round-to-nearest, but because we use the -use_fast_math flag, correct rounding is disabled:

After removing -use_fast_math from the Makefile, I couldn't find any errors in several minutes of running test-backend-ops.

With Metal it is not clear to me how to configure the rounding mode, but regardless it doesn't seem to support round-to-nearest:

JohannesGaessler · 2023-12-30T12:54:12Z

What specifically does the test assert? That the results after quantization are exactly equal?

slaren · 2023-12-30T12:58:48Z

It uses a normalized MSE to compare the results between CPU and GPU, but the allowed error is only 1e-7. There isn't anything special about that value, it's just that most tests pass with that error.

ggerganov · 2023-12-30T13:14:29Z

It seems with Metal the -ffast-math is always on and I can't figure out how to disable it. At least this is my guess, because I enabled -ffast-math on the CPU and now the d value always matches:

diff --git a/Makefile b/Makefile
index 28c6d79b..cc578371 100644
--- a/Makefile
+++ b/Makefile
@@ -111,8 +111,8 @@ MK_CFLAGS     += -Ofast
 HOST_CXXFLAGS += -Ofast
 MK_NVCCFLAGS  += -O3
 else
-MK_CFLAGS     += -O3
-MK_CXXFLAGS   += -O3
+MK_CFLAGS     += -O3 -ffast-math
+MK_CXXFLAGS   += -O3 -ffast-math
 endif
 
 # clock_gettime came in POSIX.1b (1993)

However, the rounding issue remains. Sometimes, one of the Metal quants will get rounded in the wrong direction (id = 244):

  224  0.838623  0.838623, diff =  0.000000
  225 -0.835571 -0.835571, diff =  0.000000
  226 -0.835571 -0.835571, diff =  0.000000
  227 -0.320435 -0.320435, diff =  0.000000
  228  0.323486  0.323486, diff =  0.000000
  229 -0.835571 -0.835571, diff =  0.000000
  230 -0.062866 -0.062866, diff =  0.000000
  231 -0.578003 -0.578003, diff =  0.000000
  232  0.065918  0.065918, diff =  0.000000
  233  0.065918  0.065918, diff =  0.000000
  234 -0.578003 -0.578003, diff =  0.000000
  235  0.065918  0.065918, diff =  0.000000
  236 -0.449219 -0.449219, diff =  0.000000
  237  0.452271  0.452271, diff =  0.000000
  238  0.323486  0.323486, diff =  0.000000
  239 -0.578003 -0.578003, diff =  0.000000
  240 -0.449219 -0.449219, diff =  0.000000
  241 -0.062866 -0.062866, diff =  0.000000
  242  0.967407  0.967407, diff =  0.000000
  243 -0.706787 -0.706787, diff =  0.000000
  244  0.194702  0.065918, diff =  0.128784
  245  0.581055  0.581055, diff =  0.000000
  246  0.323486  0.323486, diff =  0.000000
  247  0.194702  0.194702, diff =  0.000000
  248  0.452271  0.452271, diff =  0.000000
  249  0.709839  0.709839, diff =  0.000000
  250  0.323486  0.323486, diff =  0.000000
  251 -0.964355 -0.964355, diff =  0.000000
  252 -0.578003 -0.578003, diff =  0.000000
  253  0.194702  0.194702, diff =  0.000000
  254  0.065918  0.065918, diff =  0.000000
  255 -0.191650 -0.191650, diff =  0.000000

I'm also not able to find an option to control the rounding mode.

On the CPU it should be round-to-nearest by default.

Isn't it "round-to-zero" on the CPU? Casting float to int gives:

 0.4 -> 0
 0.6 -> 0
-0.4 -> 0
-0.6 -> 0

JohannesGaessler · 2023-12-30T13:19:38Z

If my understanding of the code is correct the tolerance is relative and applied to the mean squared error. This effectively tests whether the results are equal with a precision of ~11.6 bits. For reference, FP32 has a precision of 24 bits while FP16 has a precision of 11 bits. Considering how robust neural networks are to quantization I don't think this is cause for concern.

Also keep in mind that relative errors have a poor condition number if the original value is small. Another project that I have worked on has had similar issues when approximating functions via interpolation. An alternative metric that could be used is the asymmetry $\mathrm{As}$ for two values $x_1$ and $x_2$:

$\mathrm{As}(x_1, x_2) = \frac{x_1 - x_2}{x_1 + x_2} .$

This metric is more robust when $x_1$ is small. Of course if relative fluctuations of $x_1$ are problematic this is a bad metric. But for neural networks I would assume that relatively large errors on small values have negligible effects.

ggerganov · 2023-12-30T13:24:11Z

@JohannesGaessler I will write more info about the test later - need to AFK for a few hours

slaren · 2023-12-30T13:24:13Z

Isn't it "round-to-zero" on the CPU? Casting float to int gives:

Conversion from float to int always truncates the value, discarding the fractional part, the rounding mode only applies to floating point operations.

To disable fast math with Metal, it should be possible to pass a MTLCompileOptions to newLibraryWithSource and set fastMathEnabled to false.

JohannesGaessler · 2023-12-30T13:37:59Z

You could also consider doing something similar to numpy.allclose where both an absolute and a relative tolerance are provided and false is only returned if the observed difference is larger than the sum of the absolute and the relative tolerance.

ggerganov · 2023-12-30T20:18:15Z

@JohannesGaessler For some of the more complicated ops we expect the results between the CPU and the GPU to be numerically different and in these cases the question for choosing the correct metric for measuring the error is relevant.

However, for the ggml_cpy op that I'm looking into here (specifically the F32 -> Q4_1 quantization), the expectation is that the results should be numerically identical.

I've finally found a way to disable -ffast-math for the Metal code (see #4705) and now the results between the CPU and the GPU always match.

Note that we don't plan to disable -ffast-math on the GPU during regular usage of the library. It's only important while performing tests and running the CI to guarantee that there are no bugs in the implementation.

ggml : testing GPU FP precision via quantized CPY

f64e4f0

ggerganov force-pushed the gg/gpu-prec-tests branch from ddba60f to f64e4f0 Compare December 30, 2023 17:11

ggerganov mentioned this pull request Dec 30, 2023

metal : enable shader debugging (cmake option) #4705

Merged

ggerganov closed this Dec 30, 2023

ggerganov mentioned this pull request May 22, 2024

Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) #7425

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : testing GPU FP precision via quantized CPY #4698

ggml : testing GPU FP precision via quantized CPY #4698

ggerganov commented Dec 30, 2023

slaren commented Dec 30, 2023

JohannesGaessler commented Dec 30, 2023

slaren commented Dec 30, 2023

ggerganov commented Dec 30, 2023 •

edited

Loading

JohannesGaessler commented Dec 30, 2023 •

edited

Loading

ggerganov commented Dec 30, 2023

slaren commented Dec 30, 2023 •

edited

Loading

JohannesGaessler commented Dec 30, 2023

ggerganov commented Dec 30, 2023

ggml : testing GPU FP precision via quantized CPY #4698

ggml : testing GPU FP precision via quantized CPY #4698

Conversation

ggerganov commented Dec 30, 2023

slaren commented Dec 30, 2023

JohannesGaessler commented Dec 30, 2023

slaren commented Dec 30, 2023

ggerganov commented Dec 30, 2023 • edited Loading

JohannesGaessler commented Dec 30, 2023 • edited Loading

ggerganov commented Dec 30, 2023

slaren commented Dec 30, 2023 • edited Loading

JohannesGaessler commented Dec 30, 2023

ggerganov commented Dec 30, 2023

ggerganov commented Dec 30, 2023 •

edited

Loading

JohannesGaessler commented Dec 30, 2023 •

edited

Loading

slaren commented Dec 30, 2023 •

edited

Loading