-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use threadIdx instead of %%warpid. #149
Conversation
…/dev mallocMC release 2.0.0crp
…/dev Merge dev into master for release 2.0.1crp
…/dev Release 2.1.0crp: malloc Interface, Performance, Bugs
…s/dev Release 2.2.0crp
…s/dev Release 2.3.0crp
@matthias-springer thank you for the PR and using mallocMC! ✨ Do you mind to add some details where this fails for you? (CUDA version, I took the liberty to change the PR from |
@@ -89,7 +89,7 @@ namespace DistributionPolicies{ | |||
uint32 collect(uint32 bytes){ | |||
|
|||
can_use_coalescing = false; | |||
warpid = mallocMC::warpid(); | |||
warpid = threadIdx.x >> 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use threadIdx.x / 32;
the compiler will make the bitshift out of it.
As @ax3l wrote we need to have a deeper look into it.
|
Could you please also provide the version of used CUDA driver, CUDA library and Operating System. |
I hope this helps! CUDA version 9.0, Titan Xp, driver version 390.87 running on Ubuntu 16.04.1. Basically, I noticed that warpid() sometimes returns values greater than 31. I am not sure if this is the right way to fix it but there is a similar workaround in Scatter_impl.hpp Line 936. Right now I don't have a minimal example, just a few large benchmarks that do a pretty large number of allocations and deallocations. I need a day or so to clean up my code first... |
Could it be that we should add asm volatile("mov.u32 %0, %%warpid;" : "=r"(mywarpid)); Table 120 in "PTX: Parallel Thread Execution ISA Version 2.3"
btw, didn't know that |
It still crashes with the What I don't quite understand is the meaning of the value |
That looks like a mismatch to me as well. What I thought how it's used in the algorithm is to get a "thread index inside a warp" |
@slizzered pinging you just in case you want to chime in :) |
Maybe you want to use |
does
|
No no, |
Uhhh the number of warps per multiprocessor is max:
This means we need to fix it. Never the less I will have tomorrow first a look with fresh eyes to it. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities |
Yep, the question really is if the algorithm here just needs a thread index inside a warp (laneid) or if it really cares about the index of the warp (warpid) running on the SM. This could also just be a mismatch in naming. |
It is a warp collective allocation. Instead that each thread is searching for a free memory slot in the heap all threads in a warp aggregate the requested amount of memory and one thread of the warp is searching for a free memory slot. I will fix it by adding a larger shared memory array for the intermediate offset per warp. After that, I need to review this code again, since sm_60 (Volta) each thread in the warp has its own process counter and it could be that we need to add some warp synchronizations to work correctly. |
I updated #149 (comment) to show the number of maximal warps for all architectures from the past. |
But this sounds a lot like we should use Otherwise it's one thread per warp per currently active warps on a specific SM, if that matters. In that case proceed as described. Probably just confusingly described because "warp collective operation" is not well defined. Warps per SM: careful, this must not be a compile-time constant as PTX code is forward-compatible and compiling for sm_20 and running on Kepler+ will break the assumption. Probably use |
No it is definitive not
That is true. @matthias-springer Could you please check #150. This should solve your issue. |
@psychocoderHPC #150 fixes this issue. Can be closed. |
@matthias-springer thanks again for the bug report and the help to solve the issue. |
@matthias-springer I can only second René's words, thanks a lot for your report and help! |
Segfaults in my code without this change.