Skip to content
This repository has been archived by the owner on Apr 28, 2023. It is now read-only.

Running generated CUDA kernel outside of PyTorch #466

Open
concretevitamin opened this issue Jun 1, 2018 · 9 comments
Open

Running generated CUDA kernel outside of PyTorch #466

concretevitamin opened this issue Jun 1, 2018 · 9 comments

Comments

@concretevitamin
Copy link

Hi,

I'm interested in running a TC-generated CUDA kernel outside of PyTorch. Currently, I'm using the TC options to specify grid and block dim3. E.g., with

    .mapToThreads(320)
    .mapToBlocks(32, 320)

from TC, I launch the auto-generated kernel (the __global__ func in /tmp/<tc>cuda) with the following:

dim3 grid(32,320);
dim3 block(320);
tc_kernel<<<grid, block>>> ( arguments with correct shapes; output buffer zero-out'd )

However, this seems to produce incorrect values compared to a reference implementation. Am I missing anything? Is there other necessary setup for a TC kernel to work standalone?

@ftynse
Copy link
Contributor

ftynse commented Jun 1, 2018

TC removes blocks and threads that do nothing. .mapToThreads(320) does not mean the kernel will be effectively launched with 320 threads per block. And the kernel code assumes it is going to be executed with a specific number of threads and blocks to remove unnecessary conditions from the code.

Try tc.GlobalDebugInit(["--debug_tc_mapper=true", "--logtosdterr"]) and see what it outputs around the words "tightened launch bounds". These should be grid and block sizes effectively used for launching the kernel.

@concretevitamin
Copy link
Author

concretevitamin commented Jun 2, 2018

@ftynse Thanks. I'm using the conda-installed version of TC, commit git_version: "8e112e9dccda62c30ef29208a827e783b9a7f156" where --logtosdterr is not available. Is there a workaround? Fundamentally, is there a way to figure out the launch config from already-tuned <hash>.{cuda,options} files?

Also, an orthogonal question. Let's say I previously had tuned a kernel with these cached output files:

/tmp/<hash>.cuda
/tmp/<hash>.options

If I want to start autotuning process off of this already-tuned kernel, do I pass layer.autotune(..., cache='/tmp/<hash>')? I'm seeing 100x worse "best" timing when I do this.

@nicolasvasilache
Copy link
Contributor

nicolasvasilache commented Jun 4, 2018

@concretevitamin the commit mentioned is pretty ancient, any chance you could build from source using the new build system (see the new build instructions)?
This way you would have an up-to-date version of TC and get fixes as they come.
If that is too inconvenient you can also wait until we push a new TC conda package, it will take a few more days though.

Regarding the caching and iterating, we have been using the approach successfully from C++. There may be something lurking on the python side that we missed so a repro would always be useful.
Note that we deprecated the cuda cache and only keep the topK best options (defaults to 10).

@nicolasvasilache
Copy link
Contributor

@concretevitamin in particular, if you only want to use in Python and don't care about C++ dev or benchmarks then #470 should be pretty easy to follow.

@ftynse
Copy link
Contributor

ftynse commented Jun 4, 2018

where --logtosdterr is not available.

Well, I've made a typo and it should be --logtostderr.

Fundamentally, is there a way to figure out the launch config from already-tuned .{cuda,options} files?

No. I would not have suggested to look at the debug output had there been such a way.

@skimo-openhub
Copy link
Contributor

skimo-openhub commented Jun 4, 2018 via email

@ftynse
Copy link
Contributor

ftynse commented Jun 4, 2018

Hmm... isn't the point that we should store this information somewhere?

If we had stored the generated code in the actual codebase, then the answer would have been yes. Codegen returns the launch bounds, now it's a matter of exposing the codegen call itself to python. The caller can do whatever it wants with the results.

@concretevitamin
Copy link
Author

@ftynse @nicolasvasilache I will give building from source a try.

Regarding whether or not correct launch bounds should be stored on disk after auto-tuning: it seems obvious it should be stored, otherwise how can one reuse the tuned kernels across sessions? An analogy I can think of is having successfully trained a NN but without storing the weights :)

@ftynse
Copy link
Contributor

ftynse commented Jun 8, 2018 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants