In examples/flash_decoding/example_mha_inference.py, if I change the backend to the default cython: kernel = tilelang.compile(program, out_idx=[5], target="cuda", execution_backend="dlpack") to kernel = tilelang.compile(program, out_idx=[5]), the program fails to run.
The error is:
Traceback (most recent call last):
File "/cfy/tilelang/examples/flash_decoding/example_mha_inference.py", line 311, in <module>
profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
File "/cfy/tilelang/tilelang/profiler/__init__.py", line 99, in assert_allclose
torch.cuda.synchronize()
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 985, in synchronize
return torch._C._cuda_synchronize()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
My device is H200, CUDA 12.6