Skip to content

Conversation

@AKKamath
Copy link
Contributor

For some reason cudaOccupancyMaxActiveBlocksPerMultiprocessor returns 0, so manually calculate the value instead.

…PerMultiprocessor returns 0, so manually calculate the value instead.
@Edenzzzz
Copy link
Contributor

Edenzzzz commented May 13, 2025

Confirmed that this combined with setting prefill bs to 1 does make the kernel faster. (H100)
image

@yzh119
Copy link
Collaborator

yzh119 commented May 13, 2025

Confirmed that this combined with setting prefill bs to 1 does make the kernel faster. image

Which is the GPU architecture you are testing on?

cudaDeviceGetAttribute(&num_sm, cudaDevAttrMultiProcessorCount, dev_id));
FLASHINFER_CUDA_CALL(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&num_blocks_per_sm, kernel, num_threads_p, smem_size_p));
// FLASHINFER_CUDA_CALL(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting to me, and likely a bug of cudaOccupancyMaxActiveBlocksPerMultiprocessor.
Let's merge this first, thanks for the contribution!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudaOccupancyMaxActiveBlocksPerMultiprocessor is buggy on both A100 40G and H100 for me

@yzh119 yzh119 merged commit 25fb405 into flashinfer-ai:main May 13, 2025
2 checks passed
@Edenzzzz
Copy link
Contributor

Which is the GPU architecture you are testing on?

The one above was H100, this one is A100 40G
image

@yzh119
Copy link
Collaborator

yzh119 commented May 13, 2025

@Edenzzzz there might be some problem with bandwidth measure because they exceed hardware limit (for H100, the maximum bandwidth is 3352 GB/s).

@Edenzzzz
Copy link
Contributor

Edenzzzz commented May 13, 2025

Yeah, it's not obvious to me why. Will try NCU

@Edenzzzz
Copy link
Contributor

@yzh119 The kernel timing is correct according to nsys
image
image
Bandwidth not so much
image

@AKKamath
Copy link
Contributor Author

Huh, this is leading me to believe there's still some bug in POD. Looking into it.

@AKKamath
Copy link
Contributor Author

There was a bug: #1059

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants