Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove block size template parameter from CAGRA search #1740

Merged

Conversation

enp1s0
Copy link
Member

@enp1s0 enp1s0 commented Aug 16, 2023

This PR removes block size template parameters from CAGRA search kernel functions to reduce the library size and build time.

rel: #1459

@enp1s0 enp1s0 requested a review from a team as a code owner August 16, 2023 03:56
@enp1s0 enp1s0 self-assigned this Aug 16, 2023
@github-actions github-actions bot added the cpp label Aug 16, 2023
@enp1s0 enp1s0 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed cpp labels Aug 16, 2023
@enp1s0
Copy link
Member Author

enp1s0 commented Aug 16, 2023

  • check accuracy in some dataset
  • check performance degradation
  • Remove BlockScan

@cjnolet
Copy link
Member

cjnolet commented Aug 16, 2023

@enp1s0 you should consider installing and using pre-commit. It'll automatically fix the style errors for you. Here's the instructions.

@enp1s0
Copy link
Member Author

enp1s0 commented Aug 17, 2023

@enp1s0 you should consider installing and using pre-commit. It'll automatically fix the style errors for you. Here's the instructions.

Thanks @cjnolet
I forgot to install it when I re-cloned the repository and now installed again.

@github-actions github-actions bot added the cpp label Aug 17, 2023
@enp1s0
Copy link
Member Author

enp1s0 commented Aug 23, 2023

CAGRA binary size

24M -> 6.8M (only for CC80)

Object files

# raft/cpp/build/CMakeFiles/raft_lib.dir/src/neighbors/detail/cagra/
search_multi_cta_float_uint32_dim1024_t32.cu.o
search_multi_cta_float_uint32_dim128_t8.cu.o
search_multi_cta_float_uint32_dim256_t16.cu.o
search_multi_cta_float_uint32_dim512_t32.cu.o
search_multi_cta_int8_uint32_dim1024_t32.cu.o
search_multi_cta_int8_uint32_dim128_t8.cu.o
search_multi_cta_int8_uint32_dim256_t16.cu.o
search_multi_cta_int8_uint32_dim512_t32.cu.o
search_multi_cta_uint8_uint32_dim1024_t32.cu.o
search_multi_cta_uint8_uint32_dim128_t8.cu.o
search_multi_cta_uint8_uint32_dim256_t16.cu.o
search_multi_cta_uint8_uint32_dim512_t32.cu.o
search_single_cta_float_uint32_dim1024_t32.cu.o
search_single_cta_float_uint32_dim128_t8.cu.o
search_single_cta_float_uint32_dim256_t16.cu.o
search_single_cta_float_uint32_dim512_t32.cu.o
search_single_cta_int8_uint32_dim1024_t32.cu.o
search_single_cta_int8_uint32_dim128_t8.cu.o
search_single_cta_int8_uint32_dim256_t16.cu.o
search_single_cta_int8_uint32_dim512_t32.cu.o
search_single_cta_uint8_uint32_dim1024_t32.cu.o
search_single_cta_uint8_uint32_dim128_t8.cu.o
search_single_cta_uint8_uint32_dim256_t16.cu.o
search_single_cta_uint8_uint32_dim512_t32.cu.o

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 11, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cjnolet
Copy link
Member

cjnolet commented Sep 11, 2023

/ok to test

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @enp1s0 for this work! With these changes, the CAGRA binary size is reduced to 40 MiB (from 150 MiB) which is a great size reduction!

As discussed offline:

  • we have seen 5-7% perf degradation in some tests (large batch size) due to this changes.
  • small batch size (1, 10) perf shall be checked with multi_cta kernel. If the perf is affected for the multi_cta kernel, then the changes can be reverted for that kernel. Most of the binary size reduction comes from changes in single_cta kernel, the multi_cta kernel size reduction is 8.9 MiB --> 4.2 MiB

@@ -739,34 +743,13 @@ __launch_bounds__(BLOCK_SIZE, BLOCK_COUNT) __global__

template <unsigned TEAM_SIZE, unsigned MX_DIM, typename T, typename IdxT, typename DistT>
struct search_kernel_config {
using kernel_t = decltype(&search_kernel<TEAM_SIZE, 64, 16, 64, 64, 0, MX_DIM, T, DistT, IdxT>);
using kernel_t = decltype(&search_kernel<TEAM_SIZE, 64, 64, 0, MX_DIM, T, DistT, IdxT>);

template <unsigned MAX_ITOPK, unsigned CANDIDATES, unsigned USE_BITONIC_SORT>
static auto choose_block_size(unsigned block_size) -> kernel_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above: either rename, or remove this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the function

@@ -302,15 +299,15 @@ __device__ inline void select_best_index_for_next_threshold(
// index under the condition that the sum of the number of elements found
// so far ('nx_below_threshold') and the csum value does not exceed the
// topk value.
typedef BlockScan<uint32_t, blockDim_x> BlockScanT;
typedef block_scan<uint32_t> BlockScanT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to replace cub::BlockScan with a custom one?

Copy link
Contributor

@tfeher tfeher Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capturing our offline discussion: cub has the blocksize as a template argument, and this PR removes this template arg, therefore we cannot use cub directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block size can have the following values: 64, 128, 256, 512, 1024. Could we still keep cub, and do a dispatch based on the runtime arg, like:

switch(blockDim.x) {
  case 64:
   typedef cub::BlockScan<uint32_t, 64> BlockScanT;
   BlockScanT(temp_storage).InclusiveSum(csum, csum);
  break;
case 128:
 ...
}

Instead of adding a custom blockselect implementation that we need to maintain long term, it would be strongly preferred to rely on cub. Since we are instantiating multiple variants of BlockScan, I expect that it will slightly increase the binary size. But the main search kernels are still without the block size template args, and I hope that constitutes the main part of binary size saving.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that it is preferred to rely on cub. Let me check if the register usage of your suggested method is less than or equal to the custom block scan implementation, as it can cause throughput degradation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The register usage is the same in most cases, but the CUB implementation you mentioned is less in some cases. So, I have changed the implementation as you indicated. The search throughput is almost the same as the custom block scan implementation.

@cjnolet
Copy link
Member

cjnolet commented Sep 20, 2023

@enp1s0 burndown is starting tomorrow and lasts for a week. Do you think we can get this PR merged before burndown ends?

@enp1s0
Copy link
Member Author

enp1s0 commented Sep 20, 2023

@tfeher Thank you for reviewing the code.

@cjnolet Yes, we can probably merge this PR or decide not to merge it by the burndown ends. The basic implementation is already done. The remaining tasks are as follows:

  • Check the register usage of custom block scan and CUB. (Comment: Remove block size template parameter from CAGRA search #1740)
  • Detailed throughput comparison
    • There is performance degradation on large-batch queries by single-cat, as expected.
    • There is a performance improvement on small batch queries by multi-cat, contrary to what is expected.

@enp1s0 enp1s0 changed the title [WIP] Remove block size template parameter from CAGRA search Remove block size template parameter from CAGRA search Sep 26, 2023
@cjnolet
Copy link
Member

cjnolet commented Sep 26, 2023

/ok to test

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @enp1s0 for the updates and for the thorough benchmarks. The PR looks good to me.

@cjnolet
Copy link
Member

cjnolet commented Sep 26, 2023

/merge

@rapids-bot rapids-bot bot merged commit 6c7cada into rapidsai:branch-23.10 Sep 26, 2023
@enp1s0 enp1s0 deleted the remove-block_size-from-CAGRA-2 branch September 27, 2023 01:26
jiangyinzuo added a commit to jiangyinzuo/raft that referenced this pull request Jun 2, 2024
PR rapidsai#1740 forgot to rename `BLOCK_SIZE` in `#ifdef _CLK_BREAKDOWN`
blocks.

also remove an unused function in search_single_cta_kernel-inl.cuh
jiangyinzuo added a commit to jiangyinzuo/raft that referenced this pull request Jun 2, 2024
PR rapidsai#1740 forgot to rename `BLOCK_SIZE` in `#ifdef _CLK_BREAKDOWN`
blocks.

The use of `RAFT_LOG_DEBUG` in kernel function results in compilation
errors, replace it with `printf`.

Also remove an unused function in search_single_cta_kernel-inl.cuh
rapids-bot bot pushed a commit that referenced this pull request Jun 27, 2024
PR #1740 forgot to rename `BLOCK_SIZE` in `#ifdef _CLK_BREAKDOWN` blocks.

The use of `RAFT_LOG_DEBUG` in kernel function results in compilation
errors, replace it with `printf`.

Also remove an unused function in search_single_cta_kernel-inl.cuh

After merging:

- [x] port to cuVS rapidsai/cuvs#202

Authors:
  - Yinzuo Jiang (https://github.com/jiangyinzuo)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - tsuki (https://github.com/enp1s0)
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #2350
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

3 participants