-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up index type handling #2570
Conversation
All tests are green. I was concerned if there would be any change with the benchmarks, but I confirmed all of the generated CUDA kernels (i.e., __tmp_kernel*.cu) are exactly the same as before, so I'm pretty confident nothing changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the cleanup! Left some minor comments.
at::Tensor t0 = at::randn({999}, options); | ||
std::vector<c10::IValue> small_inputs = {t0, t0}; | ||
|
||
at::Tensor t0_large = at::randn({std::numeric_limits<int>::max()}, options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be safer to use (int64_t)std::numeric_limits<int>::max() + 1
? In theory std::numeric_limits<int>::max()
is still indexable with 32bit indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes how we determine the index mode, i.e., we only allow up to half of the int32 max. If this assumption breaks, the TORCH_ERROR
about large_inputs
should break, so this should be fine.
@@ -135,7 +135,8 @@ bool PointerOf::operator==(const PointerOf& other) const { | |||
|
|||
enum class KernelIndexMode { INT32, INT64 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking this PR because this PR is already making the logic much more clear, but I think we should consider completely remote KernelIndexMode
and just use PrimDataType
wherever KernelIndexMode
was used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally agreed, and I did consider it as well, but technically speaking, KernelIndexModel
would allow something more than just always int32 or int64, although I don't have any specific idea.
This is a cleanup PR of the code around the kernel index mode/type.
FusionExecutor::compileFusion
respects the optional index type parameter if given. Previously, it wasn't used at all. The index type is now used as long as it doesn't conflict with the required index type for the given kernel inputs. More specifically, if the index type is int32, but the kernel inputs need int64, it throws an error. On the other hand, int64 index type is allowed even when the kernel inputs are small enough to use int32.FusionExecutor::runFusion
also has an option,CompileParams
, which includes an index type. It's just unused before. It's now an error if a different index type is passed since the index type of aKernel
is immutable.Kernel
must not beDataType::Index
and must be eitherInt
orInt32
. Currently we assume the index type is resolved at the time when aFusion
is lowered to aKernel
and then it's immutable. We could relax this restriction, but that's not part of this PR.SchedulerEntry
and itsHeuristicParams
have index type info. It seems the one ofHeuristicParams
wasn't used. Removed the one fromSchedulerEntry
, and made sure the one ofHeuristicParams
is used.I'm sure we could do more, but I'll stop here for now.
Note that this does not address the issues @mmigdal-nv attempted to fix (#2522). Notably, these remain:
Fusion
is lowered to aKernel
, its index type cannot be changed. I believe we could relax this constraint, but it's unclear how important it would be. We don't want to compile back and forth between int32 and int64, so we would need to keep two compiled kernel images of a singleKernel
. This would certainly reduce the overhead of lowering aFusion
to aKernel
as it would need to be done just once for both int32 and int64, but the nvrtc compilation still needs to be done twice. And it only matters when the (rest of) scheduler heuristics are the same for problem sizes that range from small enough to use int32 and to large enough to use int64.I think issue 1 is important and should be fixed, but the second one doesn't seem that urgent.