-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large slowdown for global memory segmented histogram on GPU #2024
Comments
The slowdown being 18K bins (shared memory) compared to 20K bins (global memory). |
While slowdown is expected for this operator, as it requires a spinlock, three orders of magnitude is too much. I expect the segmented case is incorrectly tuned in code generation. |
I think the problem is that we are computing the wrong index for the lock to use, meaning lock contention will be extremely high. |
nhey
pushed a commit
to nhey/futhark
that referenced
this issue
Oct 24, 2023
(cherry picked from commit a329cbb)
CKuke
pushed a commit
to CKuke/futhark-seq
that referenced
this issue
Nov 8, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm observing three orders-of-magnitude slowdowns for argmin-like segmented histograms on GPU when computed in global memory:
Inspecting runs with different number of bins using
-D -P
on the executable shows that the global memory version of the seghist kernel is the culprit. To generate data corresponding to 18, 20 and 50 thousand bins with uniformly distributed updates:The first two should illustrate the difference between shared memory and global memory on an A100. I get similar slowdowns on my desktop GPU.
The text was updated successfully, but these errors were encountered: