Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark blockgather (method 4) #309

Open
janden opened this issue Jul 5, 2023 · 18 comments
Open

Benchmark blockgather (method 4) #309

janden opened this issue Jul 5, 2023 · 18 comments

Comments

@janden
Copy link
Collaborator

janden commented Jul 5, 2023

Currently, we have two methods, GM (non-uniform points-driven) and SM (subproblems) for dimensions one and two, while dimension three has a third method, blockgather. We should test this against subproblems to see where it is useful.

@ImanHosseini
Copy link

ImanHosseini commented Sep 6, 2023

For M = 2000000, (on RTX 3060 Ti):

# for different n1/n2/n3 and m \in {1,2,4}
./cuperftest --N1=n1 --N2=n2 --N3=n3 --method=m

plot_2M
m=4 is much slower. 'spread_3d_block_gather' achieves very low (< 3%) memory throughput.

@janden
Copy link
Collaborator Author

janden commented Sep 6, 2023

@ImanHosseini Thanks for looking at this. Interesting that even at small sizes it performs much worse.

@MelodyShih Do you recall what the purpose of the block gather method was? Is there some other benefit to using it?

@ahbarnett
Copy link
Collaborator

ahbarnett commented Sep 6, 2023 via email

@lu1and10
Copy link
Member

lu1and10 commented Sep 7, 2023

Hi Iman, Thanks for testing this - do you have an application in mind? I think method=4 is experimental (not discussed in our paper) and not recommended - can others comment? I'm looping in @blackwer who has recently improved cufinufft, and wrote cuperftest. However I'm alarmed that meth=2 is 5x slower than meth=1, when it supposed to be faster. If you study them, there are two effects you are seeing in your graphs: i) M-dependent spreading cost (the intercept of the graphs). This appears to be 0.03s for meth=1,2, consistent with about 7e7 NU pts/s (if you are timing the average over 10 runs - are you?). Compare our results in the paper (Fig. 4): https://arxiv.org/abs/2102.08463 The default opts for the cuperftest are single-prec (FP32), for which V100 (our test) and your Ti 3060 appear to be similar on paper (16 Tflops). Our Fig. 4 (top-right panel) then gives around 8ns (for SM, ie meth=2), or 14 ns (for GM-sort, ie meth=1), per NU pt, at tol=1e-5. Either case it's around 1e8 NU pt/sec, at N=100^3. So, you are 70% of this speed. Maybe you want to verify timings vs our paper? But 0.4s intercept for meth=4 is clearly unusably slow per NU pt. ii) linear-in-N growth, presumably due to cuFFT calls of grid-size upsampfac^3N = 8N, where N:=N1N2*N3 (# uniform pts). This should be independent of method, but you can see it is not. This is a real mystery to me - can others comment? Taking meth=1, slope is about 0.07s for 8N=1.3e8, ie 2e9 pts/sec for the FFT. This is for 3D complex fp32 of size 2^27. That seems about right. But I find it very hard to find wall-clock timings (not TFlops!) for 3D cufft online for, say, V100 - Melody? meth=2 is 5x slower, but I don't get why, since meth doesn't affect the cufft call, right? I encourage you also to run, eg cufinufft3d_test 1 1 256 256 256 2000000 1e-5 2e-4 f Robert, maybe you can comment on use of cuperftest? (does it report an average time or total time for the default n_runs = 10 ?) I see kerevalmeth=0 is set as default here

kerevalmethod = std::stoi(get_or(options_map, "kerevalmethod", "0"));
Should that be 1 (Horner)? (Although that shouldn't affect the N-growth of ii) above). Best, Alex

I might have some idea.
cufft 3d single precision complex to complex on h100 for 256^3 problem size may take about 4ms for exec only.
For type1, both method 1 and method 2 should spend same amount of time on FFT, deconvolve and shuffle.

The time difference between method 1 and method 2 should come from spread kernel.
Here the number of nu pts per cell(M/(upsampfac^3*N)) makes the difference for method 1 and method 2.

Loosely speaking, for method 1 we can think each cuda thread represents an nu point. If the number of nu pts per cell is large, there will be more global memory atomic updates for race condition. As N grows, number of nu pts per cell decreases, there will be less atomic race condition.

On the other hand, for method 2, we can approximately think each cuda thread represents a cell. If N grows, the number of nu pts per cell decreases, when the number of nu pts per cell decreases and reaches less than 1, some thread(cell) will have no point, here comes the thread divergence, there will be both intra warp thread divergence and inter warps thread divergence(since we use shared memory and synchronize cuda block).

I would guess in the uniform random points case, if the number of nu pts per cell > 2, method 2 wins. if the number of nu pts per cell < 1, method 1 wins. In the plot, with M=2e6, N1=N2=N3=256, the number of nu pts per cell is much less than 1, there will be much more thread divergence for method 2.

I guess if N1=N2=N3=10, there will be a lot global memory atomic update races for method1, each thread(nu point) try to atomic update cells. While method 2 has much less atomic race since each thread represents a cell and also shared memory atomic update should be faster than global memory atomic update.

@ImanHosseini
Copy link

I do not have an application in mind, just wanted to help with the issue.
The numbers are total numbers for 10 runs (which is what the test prints out as "total").
The difference in method 1, 2 comes down to the 'spread_3d_subprob' kernel as @lu1and10 suggested and not the cufft calls. I have checked that indeed for high M, method 2 ('spread_3d_subprob' kernel) fares much better with 8x8x8 than 256x256x256 directly due to less time spent waiting on barrier. In fact, looking at Warp States, at N=8^3, 1.2% of time warps were stalled due to a barrier, this number jumps to 76% for N=256^3 tanking performance. I ran new tests at small N regime (n = 4^3->20^3) for M=2K, 20K, 200K, 2M:
image (1)
image (2)
At high N, the time is dominated by 'spread_3d_subprob' and thus the worse slope compared to method 1. Method1 is dominated by kernel 'spread_3d_nupts_driven' which has much better memory throughput (27% at N=256^3) - no syncs -.

@blackwer
Copy link
Member

blackwer commented Sep 7, 2023

m=4 is much slower. 'spread_3d_block_gather' achieves very low (< 3%) memory throughput.

Method 4 should be viewed as experimental, but I can provide some insight here. This only happens when using naive evaluation via kerevalmethod=0, rather than horners rule kerevalmethod=1. Looking at the code, this is because, as written, the kernel evaluations are not being memoized in the 0 case. I.e. in the cube around the point you are spreading of width n, n^3 kernel evaluations are being made, but the horner method is rightfully only doing n evaluations. I've updated the default to be 1 as it should be. There is little reason other than testing to use 0.

I'll investigate the rest as the day goes on.

@ImanHosseini
Copy link

ImanHosseini commented Sep 8, 2023

Using horner, now it's better then method2:
plot_h
And concentrating more on low N regime for fixed M:
plot_hLR
And for scaled M (M = 2(N1xN2xN3)):
plot_hLRsM
An interesting view is this 2D-map [of 40x40 data points] showing the best method for each <N, M>:
cmap_40

@ahbarnett
Copy link
Collaborator

ahbarnett commented Sep 8, 2023 via email

@lu1and10
Copy link
Member

lu1and10 commented Sep 8, 2023

Libin: Thanks for the timing. So a 512^3 (fine grid FFT) should be 32ms?

Yes, I tried once on h100 card.

Also for the detailed explanation - I agree, except for the fact that "shared memory" (on each core, used in meth 2, "SM" in paper) is supposed to be much faster than global memory (meth 1, "GM-sort" in paper). In our paper this is why SM was faster even though the density (another word for your NU pts per grid pt = M/8N) was around 1, not that high.

Yes, I agree that shared memory is much faster than global memory. I mean when M/8N is much smaller than 1, method 2 has much more thread divergence which causes the slow down. If M/8N is much smaller than 1, method 1's global memory atomic write is still slow, but there is much less atomic race condition(atomic write is very slow if more threads try to update same memory location, if threads are updating different locations, atomic write should be same speed as global memory write for method 1).
I'm curious about for some medium size N say 128128128, there should be a critical ratio M/8N that method 2 becomes slow than method 1, though not sure this ratio is gpu device dependent or not.

@lu1and10
Copy link
Member

lu1and10 commented Sep 8, 2023

So a 512^3 (fine grid FFT) should be 32ms? Seems about right.

Sorry, I missed the upsample factor; I was timing the fft in cufinufft code.
To correct, cufft 3d single precision complex to complex on h100 for 512^3 size takes about 4ms, 1024^3 takes about 32ms.

@ahbarnett
Copy link
Collaborator

wow, that's 33 G pts/sec for complex single-prec cuFFT. nice. But H100 is a recent card and costs $30k :)

In general we need to set up some more informative benchmark code (ie, report NU pts/sec for some std problem sizes), and explain (eg in docs/devnotes.rst) how to run them via cmake.

@blackwer
Copy link
Member

blackwer commented Sep 9, 2023

Robert: thanks for setting default to horner (1). The ns^3 ker evals was an old bug of Melody's that should be totally removed - I thought there was no way to trigger that any more. (ie even kereval=0 should do 3*ns evals in 3D...)

Yep. This fix for this is already on master now. I actually merged the two codepaths so the spread/interp uses the exact same code excepting the kernel evaluator. I've also merged in some improvements to cuperftest, though it still doesn't breakdown the execute step. I use nsys for those kinds of breakdowns

rm -f report*; nsys profile ./perftest/cuda/cuperftest --N1=128 --N2=128 --N3=128 --M=$((10 * 128**3)) --m=4; nsys stats report1.nsys-rep --report gpukernsum --timeunit milliseconds

there should be a critical ratio M/8N that method 2 becomes slow than method 1, though not sure this ratio is gpu device dependent or not.

I was finding on my A6000 a crossover at around M = 5N iirc

@ImanHosseini
Copy link

I guess that A. the crossover is gpu-dependent. and B. also dependent on N. Here is what I have from 40x40 data points for various N, M/8N: (on 3060 Ti)
heatmap_Rdata_r40_mm16777216
Where on V100, it seems to be more in line with what you see on A6000:
heatmap_Rdata_r40_mm16777216_V100

@blackwer
Copy link
Member

I guess that A. the crossover is gpu-dependent. and B. also dependent on N. Here is what I have from 40x40 data points for various N, M/8N: (on 3060 Ti)

Thanks for this! Could be useful as groundwork for a heuristic method selector. I have one point though: if you're using the total field of the output, this can be problematic with small N, since it includes the makeplan call. Most users interested in high throughput will only makeplan once. makeplan is also subject to considerably more noise in timings. You might want to consider just setpts+execute, or just execute as both of those are more common workflows.

@janden
Copy link
Collaborator Author

janden commented Sep 16, 2023

@ImanHosseini Thanks for these timings. Perhaps there is something to be gained from a smarter selection of method here (between 1 and 2). That being said, it does seem very GPU-dependent, so the question is here whether we would need some sort of calibration step. But when would we run this? Or should we have something akin to FFTW_MEASURE here? Sounds like it could get quite complicated and it might just be better to let the user make a (well-informed) decision about these options.

Getting back to the main topic here, method 4. From what I can tell, it's (a) experimental and (b) not necessarily faster than the standard methods (1 and 2). The natural thing would be just to remove it in that case (same as we did with method 3).

@blackwer
Copy link
Member

Getting back to the main topic here, method 4. From what I can tell, it's (a) experimental and (b) not necessarily faster than the standard methods (1 and 2). The natural thing would be just to remove it in that case (same as we did with method 3).

It is actually faster in many instances, so we shouldn't remove it. I've also since fixed the codepath issues with it. I think it could stand to have a bit more rigorous testing before the next release though.

Or should we have something akin to FFTW_MEASURE here?

I'm not against it -- I think it's a fine idea -- but should probably be put on the backburner for a while. Seems like a fun "rainy day" project.

@janden
Copy link
Collaborator Author

janden commented Sep 28, 2023

It is actually faster in many instances, so we shouldn't remove it. I've also since fixed the codepath issues with it. I think it could stand to have a bit more rigorous testing before the next release though.

Ok got it. Would be interesting to see the numbers on this when you have time.

I'm not against it -- I think it's a fine idea -- but should probably be put on the backburner for a while. Seems like a fun "rainy day" project.

Agreed.

@ahbarnett
Copy link
Collaborator

blockgather can be 2-3x faster than either SM or GM, see recent tests in discussion of #498

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants