cufinufft block_gather method can't handle large non-uniform inputs #498

blackwer · 2024-07-19T17:06:21Z

@DiamonDinoia pointed this out to me.

./perftest/cuda/cuperftest --prec f --n_runs 5 --method 4 --sort 1 --N1 256 --N2 256 --N3 256 --kerevalmethod 1 --M 1E8 --tol 1E-1

will exit out during the spread_3d_block_gather routine due to a memory access violation. This failure seems to be due to numerous integer overflow problems resulting in a wrong idxnupts.

finufft/src/cuda/3d/spreadinterp3d.cuh

Lines 330 to 347 in 22c4f9a

    
           cuda_complex<T> cnow; 
        
           for (int i = threadIdx.x; i < nupts; i += blockDim.x) { 
        
               int nidx = idxnupts[ptstart + i]; 
        
               int b = nidx / M; 
        
               int box[3]; 
        
               for (int d = 0; d < 3; d++) { 
        
                   box[d] = b % 3; 
        
                   if (box[d] == 1) 
        
                       box[d] = -1; 
        
                   if (box[d] == 2) 
        
                       box[d] = 1; 
        
                   b = b / 3; 
        
               } 
        
               int ii = nidx % M; 
        
               x_rescaled = fold_rescale(x[ii], nf1) + box[0] * nf1; 
        
               y_rescaled = fold_rescale(y[ii], nf2) + box[1] * nf2; 
        
               z_rescaled = fold_rescale(z[ii], nf3) + box[2] * nf3; 
        
               cnow = c[ii];

The code consistently crashes at cnow = c[ii]; above, because ii can become negative.

(cuda-gdb) p *&idxnupts[ptstart+i - 25]@10
$122 = {87055551, 88024706, 90598106, 93176824, 97683858, 98297635, -2093236795, -2090778381, -2090615681, -2087638239}

Given the pervasive use of int in this code for various indexes and intermediate calculations, it's hard to nail down a specific line of code as the problem. The fix, using int64 everywhere, may have performance problems. In the short, it might be useful to pin down exactly which routine is causing the overflow and check for it, returning if there's an issue with a more useful error message.

The text was updated successfully, but these errors were encountered:

ahbarnett · 2024-07-21T21:29:52Z

If you read
https://finufft.readthedocs.io/en/latest/c_gpu.html
you'll see meth>2 is unsupported.

I just verified meth=1,2 run fine for this test:

(base) ahb@workergpu094 /mnt/home/ahb/numerics/finufft/build> perftest/cuda/cuperftest --prec f --n_runs 5 --method 1 --sort 1 --N1 256 --N2 256 --N3 256 --kerevalmethod 1 --M 1E8 --tol 1E-1
# prec = f
# type = 1
# n_runs = 5
# N1 = 256
# N2 = 256
# N3 = 256
# M = 100000000
# ntransf = 1
# method = 1
# kerevalmethod = 1
# sort = 1
# tol = 0.1
event,count,tot(ms),mean(ms),std(ms),nupts/s,ns/nupt
host_to_device,1,433.910034,433.910034,0.000000,0.0,0.0
makeplan,1,17.662464,17.662464,0.000000,0.0,0.0
setpts,5,278.904816,55.780964,1.838613,1.79273e+09,0.557810
execute,5,634.794006,126.958801,0.327698,7.87657e+08,1.269588
device_to_host,1,28.293217,28.293217,0.000000,0.0,0.0
amortized,1,1393.605591,1393.605591,0.000000,3.58782e+08,2.787211
(base) ahb@workergpu094 /mnt/home/ahb/numerics/finufft/build> perftest/cuda/cuperftest --prec f --n_runs 5 --method 2 --sort 1 --N1 256 --N2 256 --N3 256 --kerevalmethod 1 --M 1E8 --tol 1E-1
# prec = f
# type = 1
# n_runs = 5
# N1 = 256
# N2 = 256
# N3 = 256
# M = 100000000
# ntransf = 1
# method = 2
# kerevalmethod = 1
# sort = 1
# tol = 0.1
event,count,tot(ms),mean(ms),std(ms),nupts/s,ns/nupt
host_to_device,1,436.325928,436.325928,0.000000,0.0,0.0
makeplan,1,18.078913,18.078913,0.000000,0.0,0.0
setpts,5,277.110779,55.422157,0.739676,1.80433e+09,0.554222
execute,5,585.890808,117.178162,0.604296,8.53401e+08,1.171782
device_to_host,1,29.232191,29.232191,0.000000,0.0,0.0
amortized,1,1346.681641,1346.681641,0.000000,3.71283e+08,2.693363

So I propose we ignore this. Unless someone wants to invest time benchmarking and debugging meth=4 (is it much faster?).

blackwer · 2024-07-22T12:17:52Z

@DiamonDinoia Has method 4 shown to be faster in any of the tests? Wondering if we should commit this method to the graveyard or if we think it's salvageable and potentially worthwhile.

DiamonDinoia · 2024-07-22T18:26:45Z

BlockGather can be faster than SM (optimized):

tolerance	BlockGather	SM
1E-1	2.949568e+08	2.973108e+09
1E-2	2.503649e+08	9.414288e+08
1E-3	2.113974e+08	4.285620e+08
1E-4	1.643028e+08	1.251205e+08
1E-5	1.260375e+08	7.427078e+07
1E-6	8.393369e+07	3.152199e+07

ahbarnett added the wontfix label Jul 21, 2024

DiamonDinoia removed the wontfix label Jul 23, 2024

ahbarnett added this to the 2.4 milestone Jul 23, 2024

ahbarnett mentioned this issue Aug 20, 2024

Benchmark blockgather (method 4) #309

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cufinufft block_gather method can't handle large non-uniform inputs #498

cufinufft block_gather method can't handle large non-uniform inputs #498

blackwer commented Jul 19, 2024

ahbarnett commented Jul 21, 2024 •

edited

Loading

blackwer commented Jul 22, 2024

DiamonDinoia commented Jul 22, 2024

cufinufft block_gather method can't handle large non-uniform inputs #498

cufinufft block_gather method can't handle large non-uniform inputs #498

Comments

blackwer commented Jul 19, 2024

ahbarnett commented Jul 21, 2024 • edited Loading

blackwer commented Jul 22, 2024

DiamonDinoia commented Jul 22, 2024

ahbarnett commented Jul 21, 2024 •

edited

Loading