Skip to content

Nan happens while running a irrelevant cuda kernel before running BatchPrefillWithPagedKVCacheSM90Run #1018

@JamesLim-sy

Description

@JamesLim-sy

Hi, we found nan happens while running BatchPrefillWithPagedKVCacheSM90Run kernel after running a irrelevant rope init kernel. And the input/output of rope init kernel has no influence on the input of BatchPrefillWithPagedKVCacheSM90Run kernel.
More specifically, we found it very strange that once running rope init or BatchPrefillWithPagedKVCacheSM90Run alone, no nan emerging. The 'nan' issue only happens while running rope init, BatchPrefillWithPagedKVCacheSM90Run in a row, and it always appears in the 1st attention head of output tensor.
Having done rounds of check, we cannot directly locate where the error is, so a recurrence demo is made in https://github.com/JamesLim-sy/gpu_issues, please check.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions