Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : testing GPU FP precision via quantized CPY #4698

Closed
wants to merge 1 commit into from

Conversation

ggerganov
Copy link
Owner

Wanted to find out why the Metal test-backend-ops is failing from time to time. This PR applies just the GGML_OP_CPY operation with F32 src -> Q4_1 dst. I.e. it performs quantization.

Running this long enough will eventually generate an error:

make -j tests && while ./tests/test-backend-ops -o CPY -b Metal ; do date ; done
# example (scroll down)

  CPY(type_src=f32,type_dst=q4_1,ne=[32,1,1,1]): ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.03 MiB, (    1.66 / 147456.00)
d =  0.127808
m = -0.954590
d =  0.127808
m = -0.954590
OK
  CPY(type_src=f32,type_dst=q4_1,ne=[32,1,1,1]): ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.03 MiB, (    1.66 / 147456.00)
d =  0.126709
m = -0.922852
d =  0.126709
m = -0.922852
OK
  CPY(type_src=f32,type_dst=q4_1,ne=[32,1,1,1]): ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.03 MiB, (    1.66 / 147456.00)
d =  0.115479
m = -0.740723
d =  0.115479
m = -0.740723
OK
  CPY(type_src=f32,type_dst=q4_1,ne=[32,1,1,1]): ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.03 MiB, (    1.66 / 147456.00)
d =  0.119568                             # <------ GPU result (notice d is different here)
m = -0.834961
d =  0.119507                             # <------ CPU result (reference)
m = -0.834961
[CPY] NMSE = 0.000001 
    0 -0.834961 -0.834961, diff =  0.000000
    1 -0.834961 -0.834961, diff =  0.000000
    2 -0.476257 -0.476440, diff =  0.000183
    3  0.599854  0.599121, diff =  0.000732
    4  0.002014  0.001587, diff =  0.000427
    5  0.958557  0.957642, diff =  0.000916
    6 -0.117554 -0.117920, diff =  0.000366
    7  0.002014  0.001587, diff =  0.000427
    8  0.719421  0.718628, diff =  0.000793
    9  0.241150  0.240601, diff =  0.000549
   10  0.121582  0.121094, diff =  0.000488
   11  0.599854  0.599121, diff =  0.000732
   12  0.121582  0.121094, diff =  0.000488
   13  0.241150  0.240601, diff =  0.000549
   14  0.121582  0.121094, diff =  0.000488
   15  0.719421  0.718628, diff =  0.000793
   16 -0.356689 -0.356934, diff =  0.000244
   17  0.002014  0.001587, diff =  0.000427
   18  0.121582  0.121094, diff =  0.000488
   19  0.002014  0.001587, diff =  0.000427
   20  0.360718  0.360107, diff =  0.000610
   21 -0.117554 -0.117920, diff =  0.000366
   22  0.241150  0.240601, diff =  0.000549
   23  0.838989  0.838135, diff =  0.000854
   24  0.599854  0.599121, diff =  0.000732
   25 -0.476257 -0.476440, diff =  0.000183
   26 -0.117554 -0.117920, diff =  0.000366
   27  0.241150  0.240601, diff =  0.000549
   28 -0.595825 -0.595947, diff =  0.000122
   29  0.002014  0.001587, diff =  0.000427
   30  0.719421  0.718628, diff =  0.000793
   31 -0.237122 -0.237427, diff =  0.000305

It looks like the floating-point operation for computing d can produce different results between the CPU and the GPU:

const float d  = (max - min) / ((1 << 4) - 1);

Not sure how to fix this - ideas?


While looking into this issue, I found that CUDA can also fail the CPY test sometimes.
On master, run this and it will eventually fail, though I haven't investigated what is the source of the error in this case:

LLAMA_CUBLAS=1 make -j tests && while ./tests/test-backend-ops -o CPY -b CUDA0 ; do date ; done
Backend 2/2 (CUDA0)
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
  Backend name: CUDA
  CPY(type_src=f32,type_dst=f32,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=f16,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=q4_0,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=q4_1,ne=[256,10,10,1]): [CPY] NMSE = 0.000004 FAIL
  CPY(type_src=f32,type_dst=q5_0,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q5_1,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q8_0,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=q2_K,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q3_K,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q4_K,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q5_K,ne=[256,10,10,1]): not supported [CUDA] 
  CPY(type_src=f32,type_dst=q6_K,ne=[256,10,10,1]): not supported [CUDA] 
  876/877 tests passed
  Backend CUDA: FAIL

1/2 backends passed
FAIL

Sometimes it can take a while. Reproed on RTX 2060 and V100

@slaren
Copy link
Collaborator

slaren commented Dec 30, 2023

It may be due to the floating point rounding mode. On the CPU it should be round-to-nearest by default. CUDA also supports round-to-nearest, but because we use the -use_fast_math flag, correct rounding is disabled:

image

After removing -use_fast_math from the Makefile, I couldn't find any errors in several minutes of running test-backend-ops.

With Metal it is not clear to me how to configure the rounding mode, but regardless it doesn't seem to support round-to-nearest:

image

@JohannesGaessler
Copy link
Collaborator

What specifically does the test assert? That the results after quantization are exactly equal?

@slaren
Copy link
Collaborator

slaren commented Dec 30, 2023

It uses a normalized MSE to compare the results between CPU and GPU, but the allowed error is only 1e-7. There isn't anything special about that value, it's just that most tests pass with that error.

@ggerganov
Copy link
Owner Author

ggerganov commented Dec 30, 2023

It seems with Metal the -ffast-math is always on and I can't figure out how to disable it. At least this is my guess, because I enabled -ffast-math on the CPU and now the d value always matches:

diff --git a/Makefile b/Makefile
index 28c6d79b..cc578371 100644
--- a/Makefile
+++ b/Makefile
@@ -111,8 +111,8 @@ MK_CFLAGS     += -Ofast
 HOST_CXXFLAGS += -Ofast
 MK_NVCCFLAGS  += -O3
 else
-MK_CFLAGS     += -O3
-MK_CXXFLAGS   += -O3
+MK_CFLAGS     += -O3 -ffast-math
+MK_CXXFLAGS   += -O3 -ffast-math
 endif
 
 # clock_gettime came in POSIX.1b (1993)

However, the rounding issue remains. Sometimes, one of the Metal quants will get rounded in the wrong direction (id = 244):

  224  0.838623  0.838623, diff =  0.000000
  225 -0.835571 -0.835571, diff =  0.000000
  226 -0.835571 -0.835571, diff =  0.000000
  227 -0.320435 -0.320435, diff =  0.000000
  228  0.323486  0.323486, diff =  0.000000
  229 -0.835571 -0.835571, diff =  0.000000
  230 -0.062866 -0.062866, diff =  0.000000
  231 -0.578003 -0.578003, diff =  0.000000
  232  0.065918  0.065918, diff =  0.000000
  233  0.065918  0.065918, diff =  0.000000
  234 -0.578003 -0.578003, diff =  0.000000
  235  0.065918  0.065918, diff =  0.000000
  236 -0.449219 -0.449219, diff =  0.000000
  237  0.452271  0.452271, diff =  0.000000
  238  0.323486  0.323486, diff =  0.000000
  239 -0.578003 -0.578003, diff =  0.000000
  240 -0.449219 -0.449219, diff =  0.000000
  241 -0.062866 -0.062866, diff =  0.000000
  242  0.967407  0.967407, diff =  0.000000
  243 -0.706787 -0.706787, diff =  0.000000
  244  0.194702  0.065918, diff =  0.128784
  245  0.581055  0.581055, diff =  0.000000
  246  0.323486  0.323486, diff =  0.000000
  247  0.194702  0.194702, diff =  0.000000
  248  0.452271  0.452271, diff =  0.000000
  249  0.709839  0.709839, diff =  0.000000
  250  0.323486  0.323486, diff =  0.000000
  251 -0.964355 -0.964355, diff =  0.000000
  252 -0.578003 -0.578003, diff =  0.000000
  253  0.194702  0.194702, diff =  0.000000
  254  0.065918  0.065918, diff =  0.000000
  255 -0.191650 -0.191650, diff =  0.000000

I'm also not able to find an option to control the rounding mode.

On the CPU it should be round-to-nearest by default.

Isn't it "round-to-zero" on the CPU? Casting float to int gives:

 0.4 -> 0
 0.6 -> 0
-0.4 -> 0
-0.6 -> 0

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Dec 30, 2023

If my understanding of the code is correct the tolerance is relative and applied to the mean squared error. This effectively tests whether the results are equal with a precision of ~11.6 bits. For reference, FP32 has a precision of 24 bits while FP16 has a precision of 11 bits. Considering how robust neural networks are to quantization I don't think this is cause for concern.

Also keep in mind that relative errors have a poor condition number if the original value is small. Another project that I have worked on has had similar issues when approximating functions via interpolation. An alternative metric that could be used is the asymmetry $\mathrm{As}$ for two values $x_1$ and $x_2$:

$\mathrm{As}(x_1, x_2) = \frac{x_1 - x_2}{x_1 + x_2} .$

This metric is more robust when $x_1$ is small. Of course if relative fluctuations of $x_1$ are problematic this is a bad metric. But for neural networks I would assume that relatively large errors on small values have negligible effects.

@ggerganov
Copy link
Owner Author

@JohannesGaessler I will write more info about the test later - need to AFK for a few hours

@slaren
Copy link
Collaborator

slaren commented Dec 30, 2023

Isn't it "round-to-zero" on the CPU? Casting float to int gives:

Conversion from float to int always truncates the value, discarding the fractional part, the rounding mode only applies to floating point operations.

To disable fast math with Metal, it should be possible to pass a MTLCompileOptions to newLibraryWithSource and set fastMathEnabled to false.

@JohannesGaessler
Copy link
Collaborator

You could also consider doing something similar to numpy.allclose where both an absolute and a relative tolerance are provided and false is only returned if the observed difference is larger than the sum of the absolute and the relative tolerance.

@ggerganov
Copy link
Owner Author

@JohannesGaessler For some of the more complicated ops we expect the results between the CPU and the GPU to be numerically different and in these cases the question for choosing the correct metric for measuring the error is relevant.

However, for the ggml_cpy op that I'm looking into here (specifically the F32 -> Q4_1 quantization), the expectation is that the results should be numerically identical.

I've finally found a way to disable -ffast-math for the Metal code (see #4705) and now the results between the CPU and the GPU always match.

Note that we don't plan to disable -ffast-math on the GPU during regular usage of the library. It's only important while performing tests and running the CI to guarantee that there are no bugs in the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants