This repository was archived by the owner on Apr 28, 2023. It is now read-only.
This repository was archived by the owner on Apr 28, 2023. It is now read-only.
Running generated CUDA kernel outside of PyTorch #466
Open
Description
Hi,
I'm interested in running a TC-generated CUDA kernel outside of PyTorch. Currently, I'm using the TC options to specify grid and block dim3. E.g., with
.mapToThreads(320)
.mapToBlocks(32, 320)
from TC, I launch the auto-generated kernel (the __global__
func in /tmp/<tc>cuda
) with the following:
dim3 grid(32,320);
dim3 block(320);
tc_kernel<<<grid, block>>> ( arguments with correct shapes; output buffer zero-out'd )
However, this seems to produce incorrect values compared to a reference implementation. Am I missing anything? Is there other necessary setup for a TC kernel to work standalone?
Metadata
Metadata
Assignees
Labels
No labels