-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Hey @ptrendx , Thanks for submitting the PR
CI supported jobs: [windows-gpu, unix-cpu, website, centos-gpu, sanity, clang, windows-cpu, miscellaneous, centos-cpu, edge, unix-gpu] Note: |
@mxnet-bot run ci [unix-cpu] |
Jenkins CI successfully triggered : [unix-cpu] |
Nice work! This will be an important and complementary addition to the work you already PR'd in #18622. Some high-level questions: Do you have any data on the overheads involved in RTC launch vs. compiled kernel launch, e.g. on the first iteration and thereafter (perhaps for both hybridized and unhybridized models)? I'm sorry to see all those floating point constants in the MXNet RTC code. Are there no compiler-defined constants that can be used, or is there a motivation for avoiding them? Having worked on these reduce functions quite a bit, you probably have a good sense of the level of testing. Do you feel it's adequate? Can RTC-based reduction invoke any new regions of the operator parameter space? |
There is an overhead on the first launch of the given kernel of 10ms-100ms since it needs to be compiled before use. After the compilation it is stored in a cache and any subsequent call is fast - I measured ~2us overhead for constructing the kernel code and cache lookup, which is comparable with the cudaLaunchKernel itself. There is not really any difference between the hybridized and nonhybridized models since the functionality works irrespective of hybridization.
No floating point constants are compiler defined - they all come from header files (e.g. ). The motivation of avoiding including external headers is to avoid the potential issues of finding the headers' location and the fact that in NVRTC we cannot include any header which contains host-only code.
I think the level of testing is generally adequate and the change to RTC does not introduce any additional parameters to be tested. It actually consolidates the functionality and so improves the testing coverage (since previously some functions were using customized versions of the kernel e.g. from |
@mxnet-bot run ci [centos-gpu, unix-cpu] |
Jenkins CI successfully triggered : [unix-cpu, centos-gpu] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My questions have been answered previously to my satisfaction.
As pointed out by the author, this PR is a continuation of the work started in #18622, which has seen ample usage by the community without issue. I feel the benefits of a smaller libmxnet.so and global-memory model footprints outweigh the penalty of slower kernel execution during first-time-use. Our every-growing body of kernels is more maintainable with this RTC framework, and perf-enhancing fusions become possible.
LGTM.
Description
This PR is a continuation of the work started in #18622. It changes the reduction operations to be compiled with runtime compilation (RTC).
As the work progresses I will update the description.
Checklist
Essentials
Changes
broadcast_reduce-inl.cuh
,broadcast_reduce_customized-inl.h
)norm
operator (ReduceAxesComputeImplWithReducer
)kron
operator