Remove block size template parameter from CAGRA search #1740

enp1s0 · 2023-08-16T03:56:04Z

This PR removes block size template parameters from CAGRA search kernel functions to reduce the library size and build time.

rel: #1459

enp1s0 · 2023-08-16T04:01:04Z

check accuracy in some dataset
check performance degradation
Remove BlockScan

cjnolet · 2023-08-16T22:52:59Z

@enp1s0 you should consider installing and using pre-commit. It'll automatically fix the style errors for you. Here's the instructions.

enp1s0 · 2023-08-17T01:30:36Z

@enp1s0 you should consider installing and using pre-commit. It'll automatically fix the style errors for you. Here's the instructions.

Thanks @cjnolet
I forgot to install it when I re-cloned the repository and now installed again.

…block size template parameter

enp1s0 · 2023-08-23T06:20:04Z

CAGRA binary size

24M -> 6.8M (only for CC80)

Object files

# raft/cpp/build/CMakeFiles/raft_lib.dir/src/neighbors/detail/cagra/
search_multi_cta_float_uint32_dim1024_t32.cu.o
search_multi_cta_float_uint32_dim128_t8.cu.o
search_multi_cta_float_uint32_dim256_t16.cu.o
search_multi_cta_float_uint32_dim512_t32.cu.o
search_multi_cta_int8_uint32_dim1024_t32.cu.o
search_multi_cta_int8_uint32_dim128_t8.cu.o
search_multi_cta_int8_uint32_dim256_t16.cu.o
search_multi_cta_int8_uint32_dim512_t32.cu.o
search_multi_cta_uint8_uint32_dim1024_t32.cu.o
search_multi_cta_uint8_uint32_dim128_t8.cu.o
search_multi_cta_uint8_uint32_dim256_t16.cu.o
search_multi_cta_uint8_uint32_dim512_t32.cu.o
search_single_cta_float_uint32_dim1024_t32.cu.o
search_single_cta_float_uint32_dim128_t8.cu.o
search_single_cta_float_uint32_dim256_t16.cu.o
search_single_cta_float_uint32_dim512_t32.cu.o
search_single_cta_int8_uint32_dim1024_t32.cu.o
search_single_cta_int8_uint32_dim128_t8.cu.o
search_single_cta_int8_uint32_dim256_t16.cu.o
search_single_cta_int8_uint32_dim512_t32.cu.o
search_single_cta_uint8_uint32_dim1024_t32.cu.o
search_single_cta_uint8_uint32_dim128_t8.cu.o
search_single_cta_uint8_uint32_dim256_t16.cu.o
search_single_cta_uint8_uint32_dim512_t32.cu.o

…ft into remove-block_size-from-CAGRA-2

copy-pr-bot · 2023-09-11T03:13:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cjnolet · 2023-09-11T03:29:22Z

/ok to test

tfeher

Thanks @enp1s0 for this work! With these changes, the CAGRA binary size is reduced to 40 MiB (from 150 MiB) which is a great size reduction!

As discussed offline:

we have seen 5-7% perf degradation in some tests (large batch size) due to this changes.
small batch size (1, 10) perf shall be checked with multi_cta kernel. If the perf is affected for the multi_cta kernel, then the changes can be reverted for that kernel. Most of the binary size reduction comes from changes in single_cta kernel, the multi_cta kernel size reduction is 8.9 MiB --> 4.2 MiB

cpp/include/raft/neighbors/detail/cagra/search_multi_cta_kernel-inl.cuh

tfeher · 2023-09-11T06:23:09Z

cpp/include/raft/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

@@ -739,34 +743,13 @@ __launch_bounds__(BLOCK_SIZE, BLOCK_COUNT) __global__

 template <unsigned TEAM_SIZE, unsigned MX_DIM, typename T, typename IdxT, typename DistT>
 struct search_kernel_config {
-  using kernel_t = decltype(&search_kernel<TEAM_SIZE, 64, 16, 64, 64, 0, MX_DIM, T, DistT, IdxT>);
+  using kernel_t = decltype(&search_kernel<TEAM_SIZE, 64, 64, 0, MX_DIM, T, DistT, IdxT>);

  template <unsigned MAX_ITOPK, unsigned CANDIDATES, unsigned USE_BITONIC_SORT>
  static auto choose_block_size(unsigned block_size) -> kernel_t


Same comment as above: either rename, or remove this function.

removed the function

tfeher · 2023-09-11T06:33:32Z

cpp/include/raft/neighbors/detail/cagra/topk_for_cagra/topk_core.cuh

@@ -302,15 +299,15 @@ __device__ inline void select_best_index_for_next_threshold(
  // index under the condition that the sum of the number of elements found
  // so far ('nx_below_threshold') and the csum value does not exceed the
  // topk value.
-  typedef BlockScan<uint32_t, blockDim_x> BlockScanT;
+  typedef block_scan<uint32_t> BlockScanT;


Why do we need to replace cub::BlockScan with a custom one?

Capturing our offline discussion: cub has the blocksize as a template argument, and this PR removes this template arg, therefore we cannot use cub directly.

Block size can have the following values: 64, 128, 256, 512, 1024. Could we still keep cub, and do a dispatch based on the runtime arg, like:

switch(blockDim.x) { case 64: typedef cub::BlockScan<uint32_t, 64> BlockScanT; BlockScanT(temp_storage).InclusiveSum(csum, csum); break; case 128: ... }

Instead of adding a custom blockselect implementation that we need to maintain long term, it would be strongly preferred to rely on cub. Since we are instantiating multiple variants of BlockScan, I expect that it will slightly increase the binary size. But the main search kernels are still without the block size template args, and I hope that constitutes the main part of binary size saving.

I agree with you that it is preferred to rely on cub. Let me check if the register usage of your suggested method is less than or equal to the custom block scan implementation, as it can cause throughput degradation.

The register usage is the same in most cases, but the CUB implementation you mentioned is less in some cases. So, I have changed the implementation as you indicated. The search throughput is almost the same as the custom block scan implementation.

cjnolet · 2023-09-20T11:26:36Z

@enp1s0 burndown is starting tomorrow and lasts for a week. Do you think we can get this PR merged before burndown ends?

enp1s0 · 2023-09-20T13:03:45Z

@tfeher Thank you for reviewing the code.

@cjnolet Yes, we can probably merge this PR or decide not to merge it by the burndown ends. The basic implementation is already done. The remaining tasks are as follows:

Check the register usage of custom block scan and CUB. (Comment: Remove block size template parameter from CAGRA search #1740)
Detailed throughput comparison
- ~~There is performance degradation on large-batch queries by single-cat, as expected.~~
- ~~There is a performance improvement on small batch queries by multi-cat, contrary to what is expected.~~

…ft into remove-block_size-from-CAGRA-2

cjnolet · 2023-09-26T14:22:52Z

/ok to test

tfeher

Thank you @enp1s0 for the updates and for the thorough benchmarks. The PR looks good to me.

cjnolet · 2023-09-26T23:23:00Z

/merge

PR rapidsai#1740 forgot to rename `BLOCK_SIZE` in `#ifdef _CLK_BREAKDOWN` blocks. also remove an unused function in search_single_cta_kernel-inl.cuh

PR rapidsai#1740 forgot to rename `BLOCK_SIZE` in `#ifdef _CLK_BREAKDOWN` blocks. The use of `RAFT_LOG_DEBUG` in kernel function results in compilation errors, replace it with `printf`. Also remove an unused function in search_single_cta_kernel-inl.cuh

PR #1740 forgot to rename `BLOCK_SIZE` in `#ifdef _CLK_BREAKDOWN` blocks. The use of `RAFT_LOG_DEBUG` in kernel function results in compilation errors, replace it with `printf`. Also remove an unused function in search_single_cta_kernel-inl.cuh After merging: - [x] port to cuVS rapidsai/cuvs#202 Authors: - Yinzuo Jiang (https://github.com/jiangyinzuo) - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - tsuki (https://github.com/enp1s0) - Tamas Bela Feher (https://github.com/tfeher) URL: #2350

enp1s0 added 5 commits August 8, 2023 17:58

Remove BLOCK_SIZE form CAGRA single-CTA

d6dd6ba

Update search_plan_impl

23c9e17

Update block size

43214a0

Merge branch 'branch-23.10' into remove-block_size-from-CAGRA-2

4e6f911

Merge branch 'branch-23.10' into remove-block_size-from-CAGRA-2

859b9cc

enp1s0 requested a review from a team as a code owner August 16, 2023 03:56

enp1s0 self-assigned this Aug 16, 2023

github-actions bot added the cpp label Aug 16, 2023

enp1s0 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed cpp labels Aug 16, 2023

Fix format

c470414

github-actions bot added the cpp label Aug 17, 2023

enp1s0 and others added 5 commits August 18, 2023 17:45

Remove cub from CAGRA

d0cef64

Update select_best_index_for_next_threshold algorithm not to require …

b5f2695

…block size template parameter

Update CAGRA topk

70df860

Update __launch_bounds__

a945c7b

Merge branch 'branch-23.10' into remove-block_size-from-CAGRA-2

1db2979

enp1s0 and others added 3 commits August 23, 2023 01:17

Update max block size

95aac87

Merge branch 'remove-block_size-from-CAGRA-2' of github.com:enp1s0/ra…

881bb15

…ft into remove-block_size-from-CAGRA-2

Merge branch 'branch-23.10' into remove-block_size-from-CAGRA-2

91d9975

tfeher requested changes Sep 11, 2023

View reviewed changes

Remove kernel choose functions

f8253b8

Merge branch 'branch-23.10' into remove-block_size-from-CAGRA-2

a383916

enp1s0 and others added 9 commits September 21, 2023 00:00

Fix block sort to use cub instead of custom one

8362402

Merge branch 'remove-block_size-from-CAGRA-2' of github.com:enp1s0/ra…

3564d32

…ft into remove-block_size-from-CAGRA-2

Merge branch 'branch-23.10' into remove-block_size-from-CAGRA-2

b9121ca

Fix block_scan

38a462f

Update select_best_index_for_next_threshold

d3856ab

Fix typo

a2fd042

Merge branch 'branch-23.10' into remove-block_size-from-CAGRA-2

95d010e

Fix format

ef4fb91

Merge branch 'branch-23.10' into remove-block_size-from-CAGRA-2

3bbccc5

enp1s0 changed the title ~~[WIP] Remove block size template parameter from CAGRA search~~ Remove block size template parameter from CAGRA search Sep 26, 2023

tfeher approved these changes Sep 26, 2023

View reviewed changes

rapids-bot bot merged commit 6c7cada into rapidsai:branch-23.10 Sep 26, 2023

enp1s0 deleted the remove-block_size-from-CAGRA-2 branch September 27, 2023 01:26

jiangyinzuo mentioned this pull request Jun 2, 2024

Fix compilation error when _CLK_BREAKDOWN is defined in cagra. #2350

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove block size template parameter from CAGRA search #1740

Remove block size template parameter from CAGRA search #1740

enp1s0 commented Aug 16, 2023

enp1s0 commented Aug 16, 2023 •

edited

Loading

cjnolet commented Aug 16, 2023

enp1s0 commented Aug 17, 2023

enp1s0 commented Aug 23, 2023

copy-pr-bot bot commented Sep 11, 2023

cjnolet commented Sep 11, 2023

tfeher left a comment

tfeher Sep 11, 2023

enp1s0 Sep 20, 2023

tfeher Sep 11, 2023

tfeher Sep 11, 2023 •

edited

Loading

tfeher Sep 11, 2023

enp1s0 Sep 20, 2023

enp1s0 Sep 21, 2023

cjnolet commented Sep 20, 2023

enp1s0 commented Sep 20, 2023 •

edited

Loading

cjnolet commented Sep 26, 2023

tfeher left a comment

cjnolet commented Sep 26, 2023

Remove block size template parameter from CAGRA search #1740

Remove block size template parameter from CAGRA search #1740

Conversation

enp1s0 commented Aug 16, 2023

enp1s0 commented Aug 16, 2023 • edited Loading

cjnolet commented Aug 16, 2023

enp1s0 commented Aug 17, 2023

enp1s0 commented Aug 23, 2023

CAGRA binary size

Object files

copy-pr-bot bot commented Sep 11, 2023

cjnolet commented Sep 11, 2023

tfeher left a comment

Choose a reason for hiding this comment

tfeher Sep 11, 2023

Choose a reason for hiding this comment

enp1s0 Sep 20, 2023

Choose a reason for hiding this comment

tfeher Sep 11, 2023

Choose a reason for hiding this comment

tfeher Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

tfeher Sep 11, 2023

Choose a reason for hiding this comment

enp1s0 Sep 20, 2023

Choose a reason for hiding this comment

enp1s0 Sep 21, 2023

Choose a reason for hiding this comment

cjnolet commented Sep 20, 2023

enp1s0 commented Sep 20, 2023 • edited Loading

cjnolet commented Sep 26, 2023

tfeher left a comment

Choose a reason for hiding this comment

cjnolet commented Sep 26, 2023

enp1s0 commented Aug 16, 2023 •

edited

Loading

tfeher Sep 11, 2023 •

edited

Loading

enp1s0 commented Sep 20, 2023 •

edited

Loading