-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark blockgather (method 4) #309
Comments
@ImanHosseini Thanks for looking at this. Interesting that even at small sizes it performs much worse. @MelodyShih Do you recall what the purpose of the block gather method was? Is there some other benefit to using it? |
Hi Iman,
Thanks for testing this - do you have an application in mind?
I think method=4 is experimental (not discussed in our paper) and not
recommended - can others comment?
I'm looping in @blackwer who has recently improved cufinufft, and wrote
cuperftest.
However I'm alarmed that meth=2 is 5x slower than meth=1, when it supposed
to be faster.
If you study them, there are two effects you are seeing in your graphs:
i) M-dependent spreading cost (the intercept of the graphs). This appears
to be 0.03s for meth=1,2, consistent with about 7e7 NU pts/s (if you are
timing the average over 10 runs - are you?).
Compare our results in the paper (Fig. 4):
https://arxiv.org/abs/2102.08463
The default opts for the cuperftest are single-prec (FP32), for which V100
(our test) and your Ti 3060 appear to be similar on paper (16 Tflops).
Our Fig. 4 (top-right panel) then gives around 8ns (for SM, ie meth=2), or
14 ns (for GM-sort, ie meth=1), per NU pt, at tol=1e-5. Either case it's
around 1e8 NU pt/sec, at N=100^3.
So, you are 70% of this speed. Maybe you want to verify timings vs our
paper?
But 0.4s intercept for meth=4 is clearly unusably slow per NU pt.
ii) linear-in-N growth, presumably due to cuFFT calls of grid-size
upsampfac^3*N = 8N, where
N:=N1*N2*N3 (# uniform pts). This should be independent of method, but you
can see it is not. This is a real mystery to me - can others comment?
Taking meth=1, slope is about 0.07s for 8N=1.3e8, ie 2e9 pts/sec for the
FFT.
This is for 3D complex fp32 of size 2^27. That seems about right. But I
find it very hard to find wall-clock timings (not TFlops!) for 3D cufft
online for, say, V100 - Melody?
meth=2 is 5x slower, but I don't get why, since meth doesn't affect the
cufft call, right?
I encourage you also to run, eg
cufinufft3d_test 1 1 256 256 256 2000000 1e-5 2e-4 f
Robert, maybe you can comment on use of cuperftest? (does it report an
average time or total time for the default n_runs = 10 ?)
I see kerevalmeth=0 is set as default here
https://github.com/flatironinstitute/finufft/blob/0e2371e7ad692eaf6c40fe31993b8dc3cdbdf54d/perftest/cuda/cuperftest.cu#L91
Should that be 1 (Horner)?
(Although that shouldn't affect the N-growth of ii) above).
Best, Alex
|
I might have some idea. The time difference between method 1 and method 2 should come from spread kernel. Loosely speaking, for method 1 we can think each cuda thread represents an nu point. If the number of nu pts per cell is large, there will be more global memory atomic updates for race condition. As N grows, number of nu pts per cell decreases, there will be less atomic race condition. On the other hand, for method 2, we can approximately think each cuda thread represents a cell. If N grows, the number of nu pts per cell decreases, when the number of nu pts per cell decreases and reaches less than 1, some thread(cell) will have no point, here comes the thread divergence, there will be both intra warp thread divergence and inter warps thread divergence(since we use shared memory and synchronize cuda block). I would guess in the uniform random points case, if the number of nu pts per cell > 2, method 2 wins. if the number of nu pts per cell < 1, method 1 wins. In the plot, with M=2e6, N1=N2=N3=256, the number of nu pts per cell is much less than 1, there will be much more thread divergence for method 2. I guess if N1=N2=N3=10, there will be a lot global memory atomic update races for method1, each thread(nu point) try to atomic update cells. While method 2 has much less atomic race since each thread represents a cell and also shared memory atomic update should be faster than global memory atomic update. |
I do not have an application in mind, just wanted to help with the issue. |
Method 4 should be viewed as experimental, but I can provide some insight here. This only happens when using naive evaluation via I'll investigate the rest as the day goes on. |
Some responses:
Robert: thanks for setting default to horner (1). The ns^3 ker evals was an
old bug of Melody's that should be totally removed - I thought there was no
way to trigger that any more. (ie even kereval=0 should do 3*ns evals in
3D...)
Libin: Thanks for the timing. So a 512^3 (fine grid FFT) should be 32ms?
Seems about right.
Also for the detailed explanation - I agree, except for the fact that
"shared memory" (on each core, used in meth 2, "SM" in paper) is supposed
to be much faster than global memory (meth 1, "GM-sort" in paper). In our
paper this is why SM was faster even though the _density_ (another word for
your NU pts per grid pt = M/8N) was around 1, not that high.
Iman: Thanks for the new timings - looks like meth=2 only helps at N << M ?
That was not our finding on the V100, where in the paper meth=2 was much
faster.
You will want to make sure kereval=1 is used always, for fair comparison -
it should now be the default as of
cf516fb
It seems like if you're reporting the *total* not *mean* time from
cuperftest, that is a huge 10x speedup relative to what I was saying in my
last post to this Issue. Ie you are getting 6e8 NU pts/sec ?
Do you agree with Libin's analysis?
However, as you can see from this confusion re total vs mean, it would be
more useful for us (& users) to phrase your timings in terms of throughput
(NU pts/sec), as in my last post, for various densities (M/(2^d N)). For
many applications the density is around 0.1-10. But just giving the timing
in secs for a problem is less useful.
I think in order to resolve the above we need to split out the "execute"
timings for spread/interp, vs fft, vs deconvolve, vs H2D and D2H transfer.
Over in the legacy repo there are spread{1,2,3}d testers which I found
useful. Maybe we could bring them over and tweak them for the new
interface?
Thanks for your ongoing work on this library, Best, Alex
…On Fri, Sep 8, 2023 at 4:52 PM Iman Hosseini ***@***.***> wrote:
Using horner fixed the perf issue, now it's better then method2:
[image: plot_h]
<https://user-images.githubusercontent.com/12172889/266699346-d8ac2a42-9428-468e-b05e-29d4a49e2520.png>
And concentrating more on low N regime for fixed M:
[image: plot_hLR]
<https://user-images.githubusercontent.com/12172889/266700964-b212da76-bb2c-464b-a284-857e86ccd23a.png>
And for scaled M (M = 2*N1*N2*N3):
[image: plot_hLRsM]
<https://user-images.githubusercontent.com/12172889/266700985-7aca419e-a361-4a18-83dd-7ec2f4a5140d.png>
An interesting view is this 2D-map showing the best method for each <N, M>:
[image: cmap_40]
<https://user-images.githubusercontent.com/12172889/266728264-8fea705e-8b46-4f1c-9c76-316b3d61e943.png>
—
Reply to this email directly, view it on GitHub
<#309 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSTENHUO23FL3K47DADXZOARVANCNFSM6AAAAAAZ7NODQY>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
Yes, I tried once on h100 card.
Yes, I agree that shared memory is much faster than global memory. I mean when M/8N is much smaller than 1, method 2 has much more thread divergence which causes the slow down. If M/8N is much smaller than 1, method 1's global memory atomic write is still slow, but there is much less atomic race condition(atomic write is very slow if more threads try to update same memory location, if threads are updating different locations, atomic write should be same speed as global memory write for method 1). |
Sorry, I missed the upsample factor; I was timing the fft in cufinufft code. |
wow, that's 33 G pts/sec for complex single-prec cuFFT. nice. But H100 is a recent card and costs $30k :) In general we need to set up some more informative benchmark code (ie, report NU pts/sec for some std problem sizes), and explain (eg in docs/devnotes.rst) how to run them via cmake. |
Yep. This fix for this is already on
I was finding on my |
Thanks for this! Could be useful as groundwork for a heuristic method selector. I have one point though: if you're using the |
@ImanHosseini Thanks for these timings. Perhaps there is something to be gained from a smarter selection of method here (between 1 and 2). That being said, it does seem very GPU-dependent, so the question is here whether we would need some sort of calibration step. But when would we run this? Or should we have something akin to Getting back to the main topic here, method 4. From what I can tell, it's (a) experimental and (b) not necessarily faster than the standard methods (1 and 2). The natural thing would be just to remove it in that case (same as we did with method 3). |
It is actually faster in many instances, so we shouldn't remove it. I've also since fixed the codepath issues with it. I think it could stand to have a bit more rigorous testing before the next release though.
I'm not against it -- I think it's a fine idea -- but should probably be put on the backburner for a while. Seems like a fun "rainy day" project. |
Ok got it. Would be interesting to see the numbers on this when you have time.
Agreed. |
blockgather can be 2-3x faster than either SM or GM, see recent tests in discussion of #498 |
Currently, we have two methods, GM (non-uniform points-driven) and SM (subproblems) for dimensions one and two, while dimension three has a third method, blockgather. We should test this against subproblems to see where it is useful.
The text was updated successfully, but these errors were encountered: