-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix illegal memory access #150
Conversation
The number of warps per multiprocessor depends on the architecture. On some places the warp id was used to access block shared memory with a fixed size of 32. Since sm_20 the number of warps per multiprocessor is 64 which can lead into a out of memory access. - add helper methods for: - MaxThreadsPerBlock - WarpSize - warpid_withinblock() - fix collective warp aggregations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, that's still 32
but with valid access now ;-)
*/ | ||
struct MaxThreadsPerBlock | ||
{ | ||
// valid for sm_2.X - sm_7.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tabs
BOOST_STATIC_ASSERT uint32_t value = 1024; | ||
}; | ||
|
||
/** the maximal number threads per block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy paste in docs title :)
*/ | ||
struct WarpSize | ||
{ | ||
// valid for sm_2.X - sm_7.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tabs
|
||
/** warp id within a block | ||
* | ||
* The id is constant over the livetime of the thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs more details please:
the warpid you calculate here is: unique for the warps in a block over the lifetime of a whole block, right?
Because it could also be unique only for the active number of threads, etc.
typo: lifetime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the warpid you calculate here is: unique for the warps in a block over the lifetime of a whole block, right?
Yes, but since a thread can only have a warpid during the lifetime, the doc is correct. As long as the thread exists it has a corresponding warpid. It do not matter if the thread is active or inactive.
* The id is constant over the livetime of the thread. | ||
* The id is not equal to warpid(). | ||
* | ||
* @return warp id within the block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in warpid()
we should add a comment that it should not be used besides for diagnostics.
Maybe we want to implement the %ctaid and %tid
getter as described by the docs.
Can you please grep
the code after your change to check no warpid
is used anywhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in warpid() we should add a comment that it should not be used besides for diagnostics.
We can extend the documentation but the point with the diagnostic is not correct e.g. warpid
is still used for a hashing function where it makes sense and it is not critical.
Maybe we want to implement the %ctaid and %tid getter as described by the docs.
I do not understand how this helps because %%tid == threadIdx
MAMC_ACCELERATOR inline boost::uint32_t warpid_withinblock() | ||
{ | ||
return ( | ||
threadIdx.z * blockDim.y * blockDim.x + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that implementation is likely not true if any of the block dims is not exactly a multiple of the warp size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this reason, %ctaid and %tid should be used to compute a virtual warp index
if such a value is needed in kernel code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that implementation is likely not true if any of the block dims is not exactly a multiple of the warp size.
It is not required that it is a multiple. It gives the linear thread idx and than device it by the warp size.
Could you please explain why it should be wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offline discussion: the question is just about the strategy of thread-linearization to warps at runtime. do we have a reference that this is the same mapping during thread scheduling?
As we found out, CTA and T (cooperative thread arrays and threads) are just PTX speech for blocks in grids and threads.
struct MaxThreadsPerBlock | ||
{ | ||
// valid for sm_2.X - sm_7.5 | ||
BOOST_STATIC_ASSERT uint32_t value = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am getting a compile error here. Shouldn't there be parentheses around any static assert? Or what does BOOST_STATIC_ASSERT do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh it is a copy past mistake it must be BOOST_STATIC_CONSTEXPR
struct WarpSize | ||
{ | ||
// valid for sm_2.X - sm_7.5 | ||
BOOST_STATIC_ASSERT uint32_t value = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must be BOOST_STATIC_CONSTEXPR
/** warp index within a multiprocessor | ||
* | ||
* Index of the warp within the multiprocessor at the moment of the query. | ||
* The result is volatile and can different with each query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index of the warp on its assigned multiprocessor
can be different
MAMC_ACCELERATOR inline boost::uint32_t warpid() | ||
{ | ||
boost::uint32_t mywarpid; | ||
asm("mov.u32 %0, %%warpid;" : "=r" (mywarpid)); | ||
return mywarpid; | ||
} | ||
|
||
/** maximum number of warps on the multiprocessor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the -> a
- remove tabs - update documentation - fix wrong used variable qualifier
1919f96
to
cab1dd5
Compare
@matthias-springer I fixed my copy past issue. Could you please test if this PR solved the issue for you. |
@psychocoderHPC It works! Thank you! |
fix #149
The number of warps per multiprocessor depends on the architecture.
In some places, the warp id was used to access block shared memory with a fixed size of 32.
Since sm_20 the number of warps per multiprocessor is 64 which can create an out of memory access.
This bug effects:
getAvailableSlotsAccelerator()
XMallocSIMD