[Bug] Example in  flash_decoding/example_mha_inference fails when using "cython" backend on NVIDIA Hopper

In `examples/flash_decoding/example_mha_inference.py`, if I change the backend to the default cython: `kernel = tilelang.compile(program, out_idx=[5], target="cuda", execution_backend="dlpack")` to `kernel = tilelang.compile(program, out_idx=[5])`, the program fails to run.

The error is: 
```
Traceback (most recent call last):
  File "/cfy/tilelang/examples/flash_decoding/example_mha_inference.py", line 311, in <module>
    profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
  File "/cfy/tilelang/tilelang/profiler/__init__.py", line 99, in assert_allclose
    torch.cuda.synchronize()
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 985, in synchronize
    return torch._C._cuda_synchronize()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
My device is H200, CUDA 12.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Example in flash_decoding/example_mha_inference fails when using "cython" backend on NVIDIA Hopper #403

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Example in flash_decoding/example_mha_inference fails when using "cython" backend on NVIDIA Hopper #403

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions