Use threadIdx instead of %%warpid. #149

matthias-springer · 2019-01-02T22:02:58Z

Segfaults in my code without this change.

…/dev mallocMC release 2.0.0crp

…/dev Merge dev into master for release 2.0.1crp

…/dev Release 2.1.0crp: malloc Interface, Performance, Bugs

…s/dev Release 2.2.0crp

…s/dev Release 2.3.0crp

ax3l · 2019-01-02T23:50:05Z

@matthias-springer thank you for the PR and using mallocMC! ✨

Do you mind to add some details where this fails for you? (CUDA version, nvcc or clang -x cuda, used GPU, maybe a mini-example?)

I took the liberty to change the PR from master to dev where new updates go first :) Feel free to rebase to make the change set cleaner if needed.

psychocoderHPC · 2019-01-03T08:40:19Z

src/include/mallocMC/distributionPolicies/XMallocSIMD_impl.hpp

@@ -89,7 +89,7 @@ namespace DistributionPolicies{
      uint32 collect(uint32 bytes){

        can_use_coalescing = false;
-        warpid = mallocMC::warpid();
+        warpid = threadIdx.x >> 5;


Please use threadIdx.x / 32; the compiler will make the bitshift out of it.

psychocoderHPC · 2019-01-03T08:44:37Z

As @ax3l wrote we need to have a deeper look into it.

we need to check if the kernel is always where this function is used is always called only with a one dimensional cuda block size. If not we need to take care of all cuda block dimensions.

psychocoderHPC · 2019-01-03T08:46:13Z

Could you please also provide the version of used CUDA driver, CUDA library and Operating System.

matthias-springer · 2019-01-03T09:06:04Z

I hope this helps! CUDA version 9.0, Titan Xp, driver version 390.87 running on Ubuntu 16.04.1.

Basically, I noticed that warpid() sometimes returns values greater than 31. I am not sure if this is the right way to fix it but there is a similar workaround in Scatter_impl.hpp Line 936.

https://github.com/ComputationalRadiationPhysics/mallocMC/blob/4b779a34cd8ba073b24f69435d71022f3988d42e/src/include/mallocMC/creationPolicies/Scatter_impl.hpp#L936

Right now I don't have a minimal example, just a few large benchmarks that do a pretty large number of allocations and deallocations. I need a day or so to clean up my code first...

ax3l · 2019-01-03T09:16:24Z

Could it be that we should add volatile to these device asm functions?

https://github.com/ComputationalRadiationPhysics/mallocMC/blob/1ca54d6572cb3f74d2df2cec79f6a59565da7771/src/include/mallocMC/mallocMC_utils.hpp#L125-L130

asm volatile("mov.u32 %0, %%warpid;"  : "=r"(mywarpid));

https://devtalk.nvidia.com/default/topic/518634/cuda-programming-and-performance/execution-id/post/3687295/#3687295

Table 120 in "PTX: Parallel Thread Execution ISA Version 2.3"

Note that %warpid is volatile and returns the location of a thread at the moment when read,
but its value may change during execution, e.g. due to rescheduling of threads following
preemption.  For this reason, %ctaid and %tid should be used to compute a virtual warp
index if such a value is needed in kernel code; %warpid is intended mainly to enable
profiling and diagnostic code to sample and log information such as work place mapping and
load distribution.

btw, didn't know that %warpid is also quite expensive (besides being wrong for our case): NvForum and SO

matthias-springer · 2019-01-03T09:44:32Z

It still crashes with the volatile in place. I think volatile affects only memory accesses, so reading a register should not be affected by it.

What I don't quite understand is the meaning of the value %%warpid. Is it the ID of a warp on an SM? In any case, the Nvidia blog post that you linked says that it can have a value between 0 and 47 on Fermi. But then %%warpid is used as an index into an array of size 32 a few lines below.

https://github.com/ComputationalRadiationPhysics/mallocMC/blob/1ca54d6572cb3f74d2df2cec79f6a59565da7771/src/include/mallocMC/distributionPolicies/XMallocSIMD_impl.hpp#L98

ax3l · 2019-01-03T09:58:54Z

That looks like a mismatch to me as well. What I thought how it's used in the algorithm is to get a "thread index inside a warp" [0-31] and the asm %%warpid is something completely different indeed.

ax3l · 2019-01-03T10:00:18Z

@slizzered pinging you just in case you want to chime in :)

matthias-springer · 2019-01-03T10:01:19Z

Maybe you want to use %%laneid then.

ax3l · 2019-01-03T10:03:47Z

does mallocMC::laneid() work for you?

A predefined, read-only special register that returns the thread’s lane within the warp.  
The lane identifier ranges from zero to WARP_SZ-1.

The predefined integer constant WARP_SZ specifies the number of threads per warp for 
the target platform; the sm_1x and sm_20 targets have a WARP_SZ value of 32.

psychocoderHPC · 2019-01-03T13:40:55Z

No no, warpid looks correct on this place. It is not laneId. Let me check if pascal can have more than 32 warps per multiprocessor.

psychocoderHPC · 2019-01-03T13:49:28Z

Uhhh the number of warps per multiprocessor is max:

64 for sm_30 - sm_70
48 for sm_20 and is 32 for sm_75.
32 for sm_12 and sm_13
24 for sm_10 and sm_11

This means we need to fix it. Never the less I will have tomorrow first a look with fresh eyes to it.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

ax3l · 2019-01-03T15:48:39Z

Yep, the question really is if the algorithm here just needs a thread index inside a warp (laneid) or if it really cares about the index of the warp (warpid) running on the SM. This could also just be a mismatch in naming.

psychocoderHPC · 2019-01-04T09:26:00Z

Yep, the question really is if the algorithm here just needs a thread index inside a warp (laneid) or if it really cares about the index of the warp (warpid) running on the SM. This could also just be a mismatch in naming.

It is a warp collective allocation. Instead that each thread is searching for a free memory slot in the heap all threads in a warp aggregate the requested amount of memory and one thread of the warp is searching for a free memory slot.

I will fix it by adding a larger shared memory array for the intermediate offset per warp. After that, I need to review this code again, since sm_60 (Volta) each thread in the warp has its own process counter and it could be that we need to add some warp synchronizations to work correctly.

psychocoderHPC · 2019-01-04T09:37:07Z

I updated #149 (comment) to show the number of maximal warps for all architectures from the past.
The current number of 32 is from the sm_13 times.

ax3l · 2019-01-04T12:20:28Z

It is a warp collective allocation. Instead that each thread is searching for a free memory slot in the heap all threads in a warp aggregate the requested amount of memory and one thread of the warp is searching for a free memory slot.

But this sounds a lot like we should use laneid to me, no? We want to allocate with one thread per warp.

Otherwise it's one thread per warp per currently active warps on a specific SM, if that matters. In that case proceed as described. Probably just confusingly described because "warp collective operation" is not well defined.

Warps per SM: careful, this must not be a compile-time constant as PTX code is forward-compatible and compiling for sm_20 and running on Kepler+ will break the assumption. Probably use %%nwarpid.

psychocoderHPC · 2019-01-04T13:11:49Z

Warps per SM: careful, this must not be a compile-time constant as PTX code is forward-compatible and compiling for sm_20 and running on Kepler+ will break the assumption. Probably use %%nwarpid.

No it is definitive not lanid. It is a warp operation.

But this sounds a lot like we should use laneid to me, no? We want to allocate with one thread per warp.

That is true. %%nwarpid is not compile time therefore it can not be used. Since we are using shared memory which is only visible within the block we need to know the warp id within a block instead on the SM. I solved the problem in #150 by creating some helper to get the warpid within the block.

@matthias-springer Could you please check #150. This should solve your issue.

matthias-springer · 2019-01-08T02:05:29Z

@psychocoderHPC #150 fixes this issue. Can be closed.

psychocoderHPC · 2019-01-08T06:56:40Z

@matthias-springer thanks again for the bug report and the help to solve the issue.

ax3l · 2019-01-08T06:59:00Z

@matthias-springer I can only second René's words, thanks a lot for your report and help!

ax3l and others added 6 commits June 2, 2014 15:15

Merge pull request alpaka-group#56 from ComputationalRadiationPhysics…

ddeae86

…/dev mallocMC release 2.0.0crp

Merge pull request alpaka-group#76 from ComputationalRadiationPhysics…

1314bf2

…/dev Merge dev into master for release 2.0.1crp

Merge pull request alpaka-group#84 from ComputationalRadiationPhysics…

799d7d7

…/dev Release 2.1.0crp: malloc Interface, Performance, Bugs

Merge pull request alpaka-group#102 from ComputationalRadiationPhysic…

80bf2b0

…s/dev Release 2.2.0crp

Merge pull request alpaka-group#147 from ComputationalRadiationPhysic…

4b779a3

…s/dev Release 2.3.0crp

Use threadIdx instead of %%warpid.

c0b7288

ax3l requested a review from psychocoderHPC January 2, 2019 23:48

ax3l assigned psychocoderHPC Jan 2, 2019

ax3l changed the base branch from master to dev January 2, 2019 23:50

ax3l added the bug label Jan 2, 2019

psychocoderHPC requested changes Jan 3, 2019

View reviewed changes

ax3l added this to the 2.4.0crp milestone Jan 3, 2019

psychocoderHPC mentioned this pull request Jan 4, 2019

fix illegal memory access #150

Merged

matthias-springer closed this Jan 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use threadIdx instead of %%warpid. #149

Use threadIdx instead of %%warpid. #149

matthias-springer commented Jan 2, 2019

ax3l commented Jan 2, 2019 •

edited

Loading

psychocoderHPC Jan 3, 2019

psychocoderHPC commented Jan 3, 2019 •

edited

Loading

psychocoderHPC commented Jan 3, 2019

matthias-springer commented Jan 3, 2019

ax3l commented Jan 3, 2019 •

edited

Loading

matthias-springer commented Jan 3, 2019 •

edited

Loading

ax3l commented Jan 3, 2019

ax3l commented Jan 3, 2019

matthias-springer commented Jan 3, 2019

ax3l commented Jan 3, 2019 •

edited

Loading

psychocoderHPC commented Jan 3, 2019

psychocoderHPC commented Jan 3, 2019 •

edited

Loading

ax3l commented Jan 3, 2019 •

edited

Loading

psychocoderHPC commented Jan 4, 2019

psychocoderHPC commented Jan 4, 2019

ax3l commented Jan 4, 2019 •

edited

Loading

psychocoderHPC commented Jan 4, 2019 •

edited

Loading

matthias-springer commented Jan 8, 2019

psychocoderHPC commented Jan 8, 2019

ax3l commented Jan 8, 2019

Use threadIdx instead of %%warpid. #149

Use threadIdx instead of %%warpid. #149

Conversation

matthias-springer commented Jan 2, 2019

ax3l commented Jan 2, 2019 • edited Loading

psychocoderHPC Jan 3, 2019

Choose a reason for hiding this comment

psychocoderHPC commented Jan 3, 2019 • edited Loading

psychocoderHPC commented Jan 3, 2019

matthias-springer commented Jan 3, 2019

ax3l commented Jan 3, 2019 • edited Loading

matthias-springer commented Jan 3, 2019 • edited Loading

ax3l commented Jan 3, 2019

ax3l commented Jan 3, 2019

matthias-springer commented Jan 3, 2019

ax3l commented Jan 3, 2019 • edited Loading

psychocoderHPC commented Jan 3, 2019

psychocoderHPC commented Jan 3, 2019 • edited Loading

ax3l commented Jan 3, 2019 • edited Loading

psychocoderHPC commented Jan 4, 2019

psychocoderHPC commented Jan 4, 2019

ax3l commented Jan 4, 2019 • edited Loading

psychocoderHPC commented Jan 4, 2019 • edited Loading

matthias-springer commented Jan 8, 2019

psychocoderHPC commented Jan 8, 2019

ax3l commented Jan 8, 2019

ax3l commented Jan 2, 2019 •

edited

Loading

psychocoderHPC commented Jan 3, 2019 •

edited

Loading

ax3l commented Jan 3, 2019 •

edited

Loading

matthias-springer commented Jan 3, 2019 •

edited

Loading

ax3l commented Jan 3, 2019 •

edited

Loading

psychocoderHPC commented Jan 3, 2019 •

edited

Loading

ax3l commented Jan 3, 2019 •

edited

Loading

ax3l commented Jan 4, 2019 •

edited

Loading

psychocoderHPC commented Jan 4, 2019 •

edited

Loading