Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use threadIdx instead of %%warpid. #149

Closed
wants to merge 6 commits into from

Conversation

matthias-springer
Copy link

Segfaults in my code without this change.

@ax3l
Copy link
Member

ax3l commented Jan 2, 2019

@matthias-springer thank you for the PR and using mallocMC! ✨

Do you mind to add some details where this fails for you? (CUDA version, nvcc or clang -x cuda, used GPU, maybe a mini-example?)

I took the liberty to change the PR from master to dev where new updates go first :) Feel free to rebase to make the change set cleaner if needed.

@ax3l ax3l changed the base branch from master to dev January 2, 2019 23:50
@ax3l ax3l added the bug label Jan 2, 2019
@@ -89,7 +89,7 @@ namespace DistributionPolicies{
uint32 collect(uint32 bytes){

can_use_coalescing = false;
warpid = mallocMC::warpid();
warpid = threadIdx.x >> 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use threadIdx.x / 32; the compiler will make the bitshift out of it.

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Jan 3, 2019

As @ax3l wrote we need to have a deeper look into it.

  • we need to check if the kernel is always where this function is used is always called only with a one dimensional cuda block size. If not we need to take care of all cuda block dimensions.

@psychocoderHPC
Copy link
Member

Could you please also provide the version of used CUDA driver, CUDA library and Operating System.

@matthias-springer
Copy link
Author

I hope this helps! CUDA version 9.0, Titan Xp, driver version 390.87 running on Ubuntu 16.04.1.

Basically, I noticed that warpid() sometimes returns values greater than 31. I am not sure if this is the right way to fix it but there is a similar workaround in Scatter_impl.hpp Line 936.

https://github.com/ComputationalRadiationPhysics/mallocMC/blob/4b779a34cd8ba073b24f69435d71022f3988d42e/src/include/mallocMC/creationPolicies/Scatter_impl.hpp#L936

Right now I don't have a minimal example, just a few large benchmarks that do a pretty large number of allocations and deallocations. I need a day or so to clean up my code first...

@ax3l
Copy link
Member

ax3l commented Jan 3, 2019

Could it be that we should add volatile to these device asm functions?

https://github.com/ComputationalRadiationPhysics/mallocMC/blob/1ca54d6572cb3f74d2df2cec79f6a59565da7771/src/include/mallocMC/mallocMC_utils.hpp#L125-L130

asm volatile("mov.u32 %0, %%warpid;"  : "=r"(mywarpid));

https://devtalk.nvidia.com/default/topic/518634/cuda-programming-and-performance/execution-id/post/3687295/#3687295

Table 120 in "PTX: Parallel Thread Execution ISA Version 2.3"

Note that %warpid is volatile and returns the location of a thread at the moment when read,
but its value may change during execution, e.g. due to rescheduling of threads following
preemption.  For this reason, %ctaid and %tid should be used to compute a virtual warp
index if such a value is needed in kernel code; %warpid is intended mainly to enable
profiling and diagnostic code to sample and log information such as work place mapping and
load distribution. 

btw, didn't know that %warpid is also quite expensive (besides being wrong for our case): NvForum and SO

@matthias-springer
Copy link
Author

matthias-springer commented Jan 3, 2019

It still crashes with the volatile in place. I think volatile affects only memory accesses, so reading a register should not be affected by it.

What I don't quite understand is the meaning of the value %%warpid. Is it the ID of a warp on an SM? In any case, the Nvidia blog post that you linked says that it can have a value between 0 and 47 on Fermi. But then %%warpid is used as an index into an array of size 32 a few lines below.

https://github.com/ComputationalRadiationPhysics/mallocMC/blob/1ca54d6572cb3f74d2df2cec79f6a59565da7771/src/include/mallocMC/distributionPolicies/XMallocSIMD_impl.hpp#L98

@ax3l
Copy link
Member

ax3l commented Jan 3, 2019

That looks like a mismatch to me as well. What I thought how it's used in the algorithm is to get a "thread index inside a warp" [0-31] and the asm %%warpid is something completely different indeed.

@ax3l
Copy link
Member

ax3l commented Jan 3, 2019

@slizzered pinging you just in case you want to chime in :)

@matthias-springer
Copy link
Author

Maybe you want to use %%laneid then.

@ax3l
Copy link
Member

ax3l commented Jan 3, 2019

does mallocMC::laneid() work for you?

A predefined, read-only special register that returns the thread’s lane within the warp.  
The lane identifier ranges from zero to WARP_SZ-1.

The predefined integer constant WARP_SZ specifies the number of threads per warp for 
the target platform; the sm_1x and sm_20 targets have a WARP_SZ value of 32. 

@psychocoderHPC
Copy link
Member

No no, warpid looks correct on this place. It is not laneId. Let me check if pascal can have more than 32 warps per multiprocessor.

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Jan 3, 2019

Uhhh the number of warps per multiprocessor is max:

  • 64 for sm_30 - sm_70
  • 48 for sm_20 and is 32 for sm_75.
  • 32 for sm_12 and sm_13
  • 24 for sm_10 and sm_11

This means we need to fix it. Never the less I will have tomorrow first a look with fresh eyes to it.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

@ax3l
Copy link
Member

ax3l commented Jan 3, 2019

Yep, the question really is if the algorithm here just needs a thread index inside a warp (laneid) or if it really cares about the index of the warp (warpid) running on the SM. This could also just be a mismatch in naming.

@ax3l ax3l added this to the 2.4.0crp milestone Jan 3, 2019
@psychocoderHPC
Copy link
Member

Yep, the question really is if the algorithm here just needs a thread index inside a warp (laneid) or if it really cares about the index of the warp (warpid) running on the SM. This could also just be a mismatch in naming.

It is a warp collective allocation. Instead that each thread is searching for a free memory slot in the heap all threads in a warp aggregate the requested amount of memory and one thread of the warp is searching for a free memory slot.

I will fix it by adding a larger shared memory array for the intermediate offset per warp. After that, I need to review this code again, since sm_60 (Volta) each thread in the warp has its own process counter and it could be that we need to add some warp synchronizations to work correctly.

@psychocoderHPC
Copy link
Member

I updated #149 (comment) to show the number of maximal warps for all architectures from the past.
The current number of 32 is from the sm_13 times.

@ax3l
Copy link
Member

ax3l commented Jan 4, 2019

It is a warp collective allocation. Instead that each thread is searching for a free memory slot in the heap all threads in a warp aggregate the requested amount of memory and one thread of the warp is searching for a free memory slot.

But this sounds a lot like we should use laneid to me, no? We want to allocate with one thread per warp.

Otherwise it's one thread per warp per currently active warps on a specific SM, if that matters. In that case proceed as described. Probably just confusingly described because "warp collective operation" is not well defined.

Warps per SM: careful, this must not be a compile-time constant as PTX code is forward-compatible and compiling for sm_20 and running on Kepler+ will break the assumption. Probably use %%nwarpid.

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Jan 4, 2019

Warps per SM: careful, this must not be a compile-time constant as PTX code is forward-compatible and compiling for sm_20 and running on Kepler+ will break the assumption. Probably use %%nwarpid.

No it is definitive not lanid. It is a warp operation.

But this sounds a lot like we should use laneid to me, no? We want to allocate with one thread per warp.

That is true. %%nwarpid is not compile time therefore it can not be used. Since we are using shared memory which is only visible within the block we need to know the warp id within a block instead on the SM. I solved the problem in #150 by creating some helper to get the warpid within the block.

@matthias-springer Could you please check #150. This should solve your issue.

@matthias-springer
Copy link
Author

@psychocoderHPC #150 fixes this issue. Can be closed.

@psychocoderHPC
Copy link
Member

@matthias-springer thanks again for the bug report and the help to solve the issue.

@ax3l
Copy link
Member

ax3l commented Jan 8, 2019

@matthias-springer I can only second René's words, thanks a lot for your report and help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants