Hi, we found nan happens while running BatchPrefillWithPagedKVCacheSM90Run kernel after running a irrelevant rope init kernel. And the input/output of rope init kernel has no influence on the input of BatchPrefillWithPagedKVCacheSM90Run kernel.
More specifically, we found it very strange that once running rope init or BatchPrefillWithPagedKVCacheSM90Run alone, no nan emerging. The 'nan' issue only happens while running rope init, BatchPrefillWithPagedKVCacheSM90Run in a row, and it always appears in the 1st attention head of output tensor.
Having done rounds of check, we cannot directly locate where the error is, so a recurrence demo is made in https://github.com/JamesLim-sy/gpu_issues, please check.