Running generated CUDA kernel outside of PyTorch

Hi,

I'm interested in running a TC-generated CUDA kernel outside of PyTorch.  Currently, I'm using the TC options to specify grid and block dim3.  E.g., with
```
    .mapToThreads(320)
    .mapToBlocks(32, 320)
```
from TC, I launch the auto-generated kernel (the `__global__` func in `/tmp/<tc>cuda`) with the following:
```
dim3 grid(32,320);
dim3 block(320);
tc_kernel<<<grid, block>>> ( arguments with correct shapes; output buffer zero-out'd )
```
However, this seems to produce **incorrect values** compared to a reference implementation.  **Am I missing anything?  Is there other necessary setup for a TC kernel to work standalone?**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running generated CUDA kernel outside of PyTorch #466

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running generated CUDA kernel outside of PyTorch #466

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions