Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix illegal memory access #150

Merged
merged 2 commits into from
Jan 8, 2019
Merged

Conversation

psychocoderHPC
Copy link
Member

@psychocoderHPC psychocoderHPC commented Jan 4, 2019

fix #149

The number of warps per multiprocessor depends on the architecture.
In some places, the warp id was used to access block shared memory with a fixed size of 32.
Since sm_20 the number of warps per multiprocessor is 64 which can create an out of memory access.

  • add helper methods for:
    • MaxThreadsPerBlock
    • WarpSize
    • warpid_withinblock()
  • fix collective warp aggregations

This bug effects:

  • the device function getAvailableSlotsAccelerator()
  • the distribution pollicy XMallocSIMD

The number of warps per multiprocessor depends on the architecture.
On some places the warp id was used to access block shared memory with a fixed size of 32.
Since sm_20 the number of warps per multiprocessor is 64 which can lead into a out of memory access.

- add helper methods for:
  - MaxThreadsPerBlock
  - WarpSize
  - warpid_withinblock()
- fix collective warp aggregations
Copy link
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, that's still 32 but with valid access now ;-)

*/
struct MaxThreadsPerBlock
{
// valid for sm_2.X - sm_7.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tabs

BOOST_STATIC_ASSERT uint32_t value = 1024;
};

/** the maximal number threads per block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy paste in docs title :)

*/
struct WarpSize
{
// valid for sm_2.X - sm_7.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tabs


/** warp id within a block
*
* The id is constant over the livetime of the thread.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs more details please:
the warpid you calculate here is: unique for the warps in a block over the lifetime of a whole block, right?
Because it could also be unique only for the active number of threads, etc.

typo: lifetime

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the warpid you calculate here is: unique for the warps in a block over the lifetime of a whole block, right?

Yes, but since a thread can only have a warpid during the lifetime, the doc is correct. As long as the thread exists it has a corresponding warpid. It do not matter if the thread is active or inactive.

* The id is constant over the livetime of the thread.
* The id is not equal to warpid().
*
* @return warp id within the block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in warpid() we should add a comment that it should not be used besides for diagnostics.
Maybe we want to implement the %ctaid and %tid getter as described by the docs.

Can you please grep the code after your change to check no warpid is used anywhere else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in warpid() we should add a comment that it should not be used besides for diagnostics.

We can extend the documentation but the point with the diagnostic is not correct e.g. warpid is still used for a hashing function where it makes sense and it is not critical.

Maybe we want to implement the %ctaid and %tid getter as described by the docs.

I do not understand how this helps because %%tid == threadIdx

MAMC_ACCELERATOR inline boost::uint32_t warpid_withinblock()
{
return (
threadIdx.z * blockDim.y * blockDim.x +
Copy link
Member

@ax3l ax3l Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that implementation is likely not true if any of the block dims is not exactly a multiple of the warp size.

Copy link
Member

@ax3l ax3l Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this reason, %ctaid and %tid should be used to compute a virtual warp index
if such a value is needed in kernel code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that implementation is likely not true if any of the block dims is not exactly a multiple of the warp size.

It is not required that it is a multiple. It gives the linear thread idx and than device it by the warp size.
Could you please explain why it should be wrong?

Copy link
Member

@ax3l ax3l Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offline discussion: the question is just about the strategy of thread-linearization to warps at runtime. do we have a reference that this is the same mapping during thread scheduling?

As we found out, CTA and T (cooperative thread arrays and threads) are just PTX speech for blocks in grids and threads.

struct MaxThreadsPerBlock
{
// valid for sm_2.X - sm_7.5
BOOST_STATIC_ASSERT uint32_t value = 1024;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting a compile error here. Shouldn't there be parentheses around any static assert? Or what does BOOST_STATIC_ASSERT do here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh it is a copy past mistake it must be BOOST_STATIC_CONSTEXPR

struct WarpSize
{
// valid for sm_2.X - sm_7.5
BOOST_STATIC_ASSERT uint32_t value = 32;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must be BOOST_STATIC_CONSTEXPR

/** warp index within a multiprocessor
*
* Index of the warp within the multiprocessor at the moment of the query.
* The result is volatile and can different with each query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index of the warp on its assigned multiprocessor
can be different

MAMC_ACCELERATOR inline boost::uint32_t warpid()
{
boost::uint32_t mywarpid;
asm("mov.u32 %0, %%warpid;" : "=r" (mywarpid));
return mywarpid;
}

/** maximum number of warps on the multiprocessor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the -> a

- remove tabs
- update documentation
- fix wrong used variable qualifier
@psychocoderHPC
Copy link
Member Author

@matthias-springer I fixed my copy past issue. Could you please test if this PR solved the issue for you.

@matthias-springer
Copy link

@psychocoderHPC It works! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants