-
Notifications
You must be signed in to change notification settings - Fork 212
Running generated CUDA kernel outside of PyTorch #466
Comments
TC removes blocks and threads that do nothing. Try |
@ftynse Thanks. I'm using the conda-installed version of TC, commit Also, an orthogonal question. Let's say I previously had tuned a kernel with these cached output files:
If I want to start autotuning process off of this already-tuned kernel, do I pass |
@concretevitamin the commit mentioned is pretty ancient, any chance you could build from source using the new build system (see the new build instructions)? Regarding the caching and iterating, we have been using the approach successfully from C++. There may be something lurking on the python side that we missed so a repro would always be useful. |
@concretevitamin in particular, if you only want to use in Python and don't care about C++ dev or benchmarks then #470 should be pretty easy to follow. |
Well, I've made a typo and it should be
No. I would not have suggested to look at the debug output had there been such a way. |
On Sun, Jun 03, 2018 at 11:09:09PM -0700, ftynse wrote:
> Fundamentally, is there a way to figure out the launch config from already-tuned <hash>.{cuda,options} files?
No. I would not have suggested to look at the debug output had there been such a way.
Hmm... isn't the point that we _should_ store this information somewhere?
skimo
|
If we had stored the generated code in the actual codebase, then the answer would have been yes. Codegen returns the launch bounds, now it's a matter of exposing the codegen call itself to python. The caller can do whatever it wants with the results. |
@ftynse @nicolasvasilache I will give building from source a try. Regarding whether or not correct launch bounds should be stored on disk after auto-tuning: it seems obvious it should be stored, otherwise how can one reuse the tuned kernels across sessions? An analogy I can think of is having successfully trained a NN but without storing the weights :) |
Well, this is not how TC tuner was designed. It does not produce CUDA, but
mapping options. Storing CUDA code is merely a side effect of running the
kernel. I think we actually killed that storage completely in the master
branch.
If you need the kernel and bounds description, give those options to the TC
compiler and it will produce the desired result. Python interface seems to
be missing the proper call for this, which has to be addressed. Nothing
more.
Picking up your analogy, autotuner is more like comparing different NNs for
test error. You keep the best architecture, but not necessarily the test
set.
…On Fri, Jun 8, 2018, 05:17 Zongheng Yang ***@***.***> wrote:
@ftynse <https://github.com/ftynse> @nicolasvasilache
<https://github.com/nicolasvasilache> I will give building from source a
try.
Regarding whether or not correct launch bounds should be stored on disk
after auto-tuning: it seems obvious it should be stored, otherwise how can
one reuse the tuned kernels across sessions? An analogy I can think of is
having successfully trained a NN but without storing the weights :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#466 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABcTa1qWaVr6P3b80WUidLxWN6tFE_OCks5t6ezngaJpZM4UWxWB>
.
|
Hi,
I'm interested in running a TC-generated CUDA kernel outside of PyTorch. Currently, I'm using the TC options to specify grid and block dim3. E.g., with
from TC, I launch the auto-generated kernel (the
__global__
func in/tmp/<tc>cuda
) with the following:However, this seems to produce incorrect values compared to a reference implementation. Am I missing anything? Is there other necessary setup for a TC kernel to work standalone?
The text was updated successfully, but these errors were encountered: