-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove block size template parameter from CAGRA search #1740
Remove block size template parameter from CAGRA search #1740
Conversation
|
CAGRA binary size24M -> 6.8M (only for CC80) Object files
|
…ft into remove-block_size-from-CAGRA-2
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @enp1s0 for this work! With these changes, the CAGRA binary size is reduced to 40 MiB (from 150 MiB) which is a great size reduction!
As discussed offline:
- we have seen 5-7% perf degradation in some tests (large batch size) due to this changes.
- small batch size (1, 10) perf shall be checked with
multi_cta
kernel. If the perf is affected for themulti_cta
kernel, then the changes can be reverted for that kernel. Most of the binary size reduction comes from changes insingle_cta
kernel, themulti_cta
kernel size reduction is 8.9 MiB --> 4.2 MiB
cpp/include/raft/neighbors/detail/cagra/search_multi_cta_kernel-inl.cuh
Outdated
Show resolved
Hide resolved
@@ -739,34 +743,13 @@ __launch_bounds__(BLOCK_SIZE, BLOCK_COUNT) __global__ | |||
|
|||
template <unsigned TEAM_SIZE, unsigned MX_DIM, typename T, typename IdxT, typename DistT> | |||
struct search_kernel_config { | |||
using kernel_t = decltype(&search_kernel<TEAM_SIZE, 64, 16, 64, 64, 0, MX_DIM, T, DistT, IdxT>); | |||
using kernel_t = decltype(&search_kernel<TEAM_SIZE, 64, 64, 0, MX_DIM, T, DistT, IdxT>); | |||
|
|||
template <unsigned MAX_ITOPK, unsigned CANDIDATES, unsigned USE_BITONIC_SORT> | |||
static auto choose_block_size(unsigned block_size) -> kernel_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above: either rename, or remove this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the function
@@ -302,15 +299,15 @@ __device__ inline void select_best_index_for_next_threshold( | |||
// index under the condition that the sum of the number of elements found | |||
// so far ('nx_below_threshold') and the csum value does not exceed the | |||
// topk value. | |||
typedef BlockScan<uint32_t, blockDim_x> BlockScanT; | |||
typedef block_scan<uint32_t> BlockScanT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to replace cub::BlockScan with a custom one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capturing our offline discussion: cub has the blocksize as a template argument, and this PR removes this template arg, therefore we cannot use cub directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block size can have the following values: 64, 128, 256, 512, 1024. Could we still keep cub
, and do a dispatch based on the runtime arg, like:
switch(blockDim.x) {
case 64:
typedef cub::BlockScan<uint32_t, 64> BlockScanT;
BlockScanT(temp_storage).InclusiveSum(csum, csum);
break;
case 128:
...
}
Instead of adding a custom blockselect implementation that we need to maintain long term, it would be strongly preferred to rely on cub
. Since we are instantiating multiple variants of BlockScan, I expect that it will slightly increase the binary size. But the main search kernels are still without the block size template args, and I hope that constitutes the main part of binary size saving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that it is preferred to rely on cub. Let me check if the register usage of your suggested method is less than or equal to the custom block scan implementation, as it can cause throughput degradation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The register usage is the same in most cases, but the CUB implementation you mentioned is less in some cases. So, I have changed the implementation as you indicated. The search throughput is almost the same as the custom block scan implementation.
@enp1s0 burndown is starting tomorrow and lasts for a week. Do you think we can get this PR merged before burndown ends? |
@tfeher Thank you for reviewing the code. @cjnolet Yes, we can probably merge this PR or decide not to merge it by the burndown ends. The basic implementation is already done. The remaining tasks are as follows:
|
…ft into remove-block_size-from-CAGRA-2
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @enp1s0 for the updates and for the thorough benchmarks. The PR looks good to me.
/merge |
PR rapidsai#1740 forgot to rename `BLOCK_SIZE` in `#ifdef _CLK_BREAKDOWN` blocks. also remove an unused function in search_single_cta_kernel-inl.cuh
PR rapidsai#1740 forgot to rename `BLOCK_SIZE` in `#ifdef _CLK_BREAKDOWN` blocks. The use of `RAFT_LOG_DEBUG` in kernel function results in compilation errors, replace it with `printf`. Also remove an unused function in search_single_cta_kernel-inl.cuh
PR #1740 forgot to rename `BLOCK_SIZE` in `#ifdef _CLK_BREAKDOWN` blocks. The use of `RAFT_LOG_DEBUG` in kernel function results in compilation errors, replace it with `printf`. Also remove an unused function in search_single_cta_kernel-inl.cuh After merging: - [x] port to cuVS rapidsai/cuvs#202 Authors: - Yinzuo Jiang (https://github.com/jiangyinzuo) - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - tsuki (https://github.com/enp1s0) - Tamas Bela Feher (https://github.com/tfeher) URL: #2350
This PR removes block size template parameters from CAGRA search kernel functions to reduce the library size and build time.
rel: #1459