-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] narrow thread extents to 32 bits for GPU lowering #10969
Conversation
python/tvm/relay/op/transform.py
Outdated
@@ -836,7 +836,7 @@ def broadcast_to(data, shape): | |||
The resulting tensor. | |||
""" | |||
if isinstance(shape, Constant): | |||
shape = list(shape.data.numpy()) | |||
shape = [int(i) for i in shape.data.numpy()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because tvm.runtime.convert handles all integer types as int32, would this cause issues with arrays larger than 4 GB? I don't think we have many of those in practice, but I think it could then cause a similar issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible, although the default handler for a python list of ints would hit the same problem. In either case, I'll push a proper fix for the reduction schedule on CUDA (forcibly cast the extent to int32
, since that's needed for CUDA anyway if I understand correctly).
Do you think I should wrap ints using numpy int64
to avoid this problem? I'm slightly worried it will break assumptions elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one possible option is to change the behavior tvm.runtime.convert
to check for overflow and use int64
when necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I like the verification that the size will fit in an int32.I could imagine edge cases, such as a user declaring an dynamically-sized input buffer with int64 size, then using a schedule that chooses the number of threads based on that size. However, that feels like enough of an edge case that it isn't worth replacing CanProveLess
with !CanProveGreaterEqual
.
OK, IMHO, the difference between this fix and #10983 is that:
|
interesting, I'm confused how it does the cast when the IterVar is int32 but the extent is int64 (since then |
Yep I think we're solving the same problem haha |
Very great point and this is where I think your PR is better than mine as I simply assume that extend.dtype.bits <= var.dtype.bits. |
But I am not sure if |
yeah I see those tests failures now... my operating assumption is that for the |
If that's the case, probably we should just force the thread index to be int32. i.e., removing CanProveLess. Or let's just use int64 and let the compiler complain about larger-than-int32 extends? |
let me try just removing the CanProveLess and see if it can pass CI |
@ganler it seems like I'm breaking some assumptions elsewhere with this change- some unit tests seem to want the thread extents to explicitly be int64. Maybe we should just go with your change if you're happy with it |
I'll close this PR once yours passes CI, thx for tracking down the other relevant PRs! |
I'm fine with either fix as well, and thank you for tracking it down! |
Occasionally, int64 constants get piped through lowering and end up as thread extents, which can cause a dtype mismatch with the thread IterVar (which should be int32 on GPU). This PR narrows extents to int32 for GPU lowering to avoid the mismatch.
I added a test case for a small
broadcast_to -> sum
program that fails to compile before this fix.cc @Lunderberg @mbrookhart