-
Notifications
You must be signed in to change notification settings - Fork 559
Fix KV chunking for POD. #1054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix KV chunking for POD. #1054
Conversation
…PerMultiprocessor returns 0, so manually calculate the value instead.
| cudaDeviceGetAttribute(&num_sm, cudaDevAttrMultiProcessorCount, dev_id)); | ||
| FLASHINFER_CUDA_CALL(cudaOccupancyMaxActiveBlocksPerMultiprocessor( | ||
| &num_blocks_per_sm, kernel, num_threads_p, smem_size_p)); | ||
| // FLASHINFER_CUDA_CALL(cudaOccupancyMaxActiveBlocksPerMultiprocessor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting to me, and likely a bug of cudaOccupancyMaxActiveBlocksPerMultiprocessor.
Let's merge this first, thanks for the contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudaOccupancyMaxActiveBlocksPerMultiprocessor is buggy on both A100 40G and H100 for me
|
@Edenzzzz there might be some problem with bandwidth measure because they exceed hardware limit (for H100, the maximum bandwidth is 3352 GB/s). |
|
Yeah, it's not obvious to me why. Will try NCU |
|
@yzh119 The kernel timing is correct according to nsys |
|
Huh, this is leading me to believe there's still some bug in POD. Looking into it. |
|
There was a bug: #1059 |






For some reason cudaOccupancyMaxActiveBlocksPerMultiprocessor returns 0, so manually calculate the value instead.