Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce compile times of distance specializations #1307

Conversation

ahendriksen
Copy link
Contributor

@ahendriksen ahendriksen commented Feb 28, 2023

Following the findings in https://github.com/ahendriksen/raft/tree/investigate-compile-time-reduction-strategies#investigation-of-compile-times, this PR reduces the compile times of the pairwise distance specializations.
This is achieved by:

  1. Reducing the number of included files in the translation units where kernels are instantiated, specifically spdlog and rmm are avoided.
  2. Limiting loop unrolling in kernels with expensive operations in the inner loop.

Additional improvements geared towards iterative development:

  1. The tests do not have to be recompiled when the internals of a pairwise distance kernel change. Before, a rebuilt was triggered due an include of raft/distance/distance.cuh.
  2. Addition of a fine tuning benchmark for the pairwise distance kernels that separates building the kernel from the benchmark code. This dramatically speeds up development. Compiling an empty benchmark takes roughly 18 seconds on my machine. Whereas recompiling a kernel takes ~3.8 seconds. Without this addition, a commit like 35a2ad4 would require substantially more time to make sure that performance is not degraded.

image

Parallel build time before: 270 seconds (6 cores, SMT, 12 jobs)
Parallel build time before: 147 seconds (6 cores, SMT, 12 jobs)

Sum of compile times before: 3022.6 seconds
Sum of compile times after:  1816.2 seconds

Comparison of compile times between headers and compiled: 
path                                                                         before (s)     after (s)  change (s) change (%)
pairwise_test                                                                    None        0.486       None     None
ance/distance/specializations/detail/lp_unexpanded_double_double_double_int.cu.o  101.1          10.3       -90.8     -89.8%
src/distance/distance/specializations/detail/canberra_float_float_float_int.cu.o   52.9           6.3       -46.6     -88.0%
/distance/distance/specializations/detail/canberra_double_double_double_int.cu.o   48.5           6.4       -42.1     -86.8%
stance/distance/specializations/detail/jensen_shannon_float_float_float_int.cu.o   65.3          10.4       -55.0     -84.1%
istance/distance/specializations/detail/kl_divergence_float_float_float_int.cu.o   70.2          12.6       -57.6     -82.0%
stance/distance/specializations/detail/correlation_double_double_double_int.cu.o   46.7           8.9       -37.8     -80.9%
distance/specializations/detail/hellinger_expanded_double_double_double_int.cu.o   41.6           8.1       -33.5     -80.6%
nce/distance/specializations/detail/jensen_shannon_double_double_double_int.cu.o   74.6          15.1       -59.5     -79.7%
ir/src/distance/distance/specializations/detail/l1_double_double_double_int.cu.o   40.9           8.4       -32.5     -79.4%
ance/distance/specializations/detail/l2_unexpanded_double_double_double_int.cu.o   40.7           8.6       -32.1     -78.8%
distance/specializations/detail/hamming_unexpanded_double_double_double_int.cu.o   40.8           9.0       -31.7     -77.8%
istance/distance/specializations/detail/lp_unexpanded_float_float_float_int.cu.o   45.9          10.2       -35.7     -77.8%
src/distance/distance/specializations/detail/l_inf_double_double_double_int.cu.o   41.2           9.5       -31.8     -77.0%
istance/distance/specializations/detail/russel_rao_double_double_double_int.cu.o   29.5           7.2       -22.3     -75.6%
t.dir/src/distance/distance/specializations/detail/l1_float_float_float_int.cu.o   47.3          13.2       -34.1     -72.2%
ce/distance/specializations/detail/hamming_unexpanded_float_float_float_int.cu.o   47.0          13.3       -33.7     -71.6%
/distance/distance/specializations/detail/correlation_float_float_float_int.cu.o   49.4          14.1       -35.3     -71.5%
ce/distance/specializations/detail/hellinger_expanded_float_float_float_int.cu.o   43.6          12.5       -31.1     -71.4%
c/distance/distance/specializations/detail/russel_rao_float_float_float_int.cu.o   28.5           8.2       -20.3     -71.2%
ance/distance/specializations/detail/kl_divergence_double_double_double_int.cu.o   75.8          21.9       -53.9     -71.1%
istance/distance/specializations/detail/l2_unexpanded_float_float_float_int.cu.o   46.2          13.5       -32.7     -70.7%
ir/src/distance/distance/specializations/detail/l_inf_float_float_float_int.cu.o   43.1          12.7       -30.4     -70.6%
stance/distance/specializations/detail/l2_expanded_double_double_double_int.cu.o   52.3          24.9       -27.3     -52.3%
/distance/distance/specializations/detail/l2_expanded_float_float_float_int.cu.o   75.8          40.3       -35.5     -46.8%
rc/distance/distance/specializations/detail/cosine_double_double_double_int.cu.o   53.5          28.7       -24.8     -46.4%
r/src/distance/distance/specializations/detail/cosine_float_float_float_int.cu.o   83.9          50.1       -33.8     -40.3%
CMakeFiles/pairwise_test.dir/test/distance/fused_l2_nn.cu.o                        85.1          64.1       -21.1     -24.7%
wise_test.dir/src/distance/distance/specializations/fused_l2_nn_float_int64.cu.o   56.2          42.9       -13.3     -23.6%
irwise_test.dir/src/distance/distance/specializations/fused_l2_nn_float_int.cu.o   52.5          40.2       -12.3     -23.5%
CMakeFiles/pairwise_test.dir/test/distance/dist_lp_unexp.cu.o                      56.3          43.3       -13.0     -23.1%
CMakeFiles/pairwise_test.dir/test/distance/dist_russell_rao.cu.o                   55.7          44.0       -11.7     -21.0%
rwise_test.dir/src/distance/distance/specializations/fused_l2_nn_double_int.cu.o   45.3          36.4        -9.0     -19.8%
CMakeFiles/pairwise_test.dir/test/distance/dist_l2_unexp.cu.o                      54.6          44.1       -10.6     -19.3%
CMakeFiles/pairwise_test.dir/test/distance/dist_canberra.cu.o                      51.6          42.1        -9.6     -18.6%
CMakeFiles/pairwise_test.dir/test/distance/dist_l2_exp.cu.o                        53.1          43.4        -9.6     -18.2%
CMakeFiles/pairwise_test.dir/test/distance/dist_l_inf.cu.o                         53.2          43.9        -9.3     -17.5%
CMakeFiles/pairwise_test.dir/test/distance/dist_hellinger.cu.o                     53.1          44.0        -9.0     -17.0%
CMakeFiles/pairwise_test.dir/test/distance/dist_hamming.cu.o                       52.3          43.4        -8.9     -17.0%
CMakeFiles/pairwise_test.dir/test/distance/dist_l2_sqrt_exp.cu.o                   54.0          45.6        -8.4     -15.6%
CMakeFiles/pairwise_test.dir/test/distance/dist_l1.cu.o                            52.6          44.5        -8.1     -15.4%
CMakeFiles/pairwise_test.dir/test/distance/dist_kl_divergence.cu.o                 52.4          44.7        -7.7     -14.8%
ise_test.dir/src/distance/distance/specializations/fused_l2_nn_double_int64.cu.o   43.5          37.2        -6.4     -14.7%
CMakeFiles/pairwise_test.dir/test/distance/dist_cos.cu.o                           52.4          44.8        -7.6     -14.5%
CMakeFiles/pairwise_test.dir/test/distance/dist_jensen_shannon.cu.o                53.2          45.7        -7.6     -14.2%
CMakeFiles/pairwise_test.dir/test/distance/dist_inner_product.cu.o                 51.1          44.8        -6.3     -12.4%
istance/distance/specializations/detail/inner_product_float_float_float_int.cu.o   39.5          35.1        -4.5     -11.3%
CMakeFiles/pairwise_test.dir/test/distance/dist_correlation.cu.o                   51.7          46.8        -4.9     -9.5%
ance/distance/specializations/detail/inner_product_double_double_double_int.cu.o   37.1          33.9        -3.1     -8.5%
src/distance/distance/specializations/detail/kernels/gram_matrix_base_float.cu.o   45.3          41.7        -3.6     -8.0%
rc/distance/distance/specializations/detail/kernels/gram_matrix_base_double.cu.o   42.5          39.6        -2.9     -6.8%
stance/distance/specializations/detail/kernels/polynomial_kernel_double_int.cu.o   40.4          38.5        -1.9     -4.8%
CMakeFiles/pairwise_test.dir/test/distance/dist_adj.cu.o                          123.3         117.8        -5.4     -4.4%
CMakeFiles/pairwise_test.dir/test/distance/gram.cu.o                               55.3          53.4        -1.9     -3.5%
build.ninja                                                                         4.0           4.0        +0.0     +0.1%
istance/distance/specializations/detail/kernels/polynomial_kernel_float_int.cu.o   45.2          45.6        +0.4     +0.8%
.dir/src/distance/distance/specializations/detail/kernels/tanh_kernel_float.cu.o   45.2          46.0        +0.8     +1.7%
dir/src/distance/distance/specializations/detail/kernels/tanh_kernel_double.cu.o   39.0          39.8        +0.8     +2.1%
CMakeFiles/pairwise_test.dir/src/distance/distance/pairwise_distance.cu.o          39.6          50.1       +10.5     +26.6%

The calculation of the tile indices are now performed in ldgXY(). This
will make it possible to remove all state related to the tile index out
of the class in the next commit.

Note that the calculation of the tile index can depend on which
overloaded constructor is called(!)
This commit moves all grid and tile indexing logic into the caller.
Contractions_NT is now only responsible for *intra*-tile indexing.

Due to the complexity of the epilog function, the ldgNextGridStride
function is not yet called from within the main loop. That is the next
goal so that we have all the grid and tile indexing localized in the
loop.
This commit removes the epilog function and moves its functionality into
the run loop. The next step might be to see if the ldgNextGridStride()
method has to be called the current location, or if performance is the
same if its called at the start of the loop.
This results in subtle issues with non-square KernelPolicy, as found in
fusedL2KNN.
This is more general than just for L1. Making use of it more is work in
progress.
This did remove support for the CUTLASS kernels. Has to be put back.
I wasted a lot of time because I had not replaced the op::core() method
of the l2_exp_distance_op after I copied it from l2_unexp_distance_op...

If I copy something from the template and forget to fill it in, I get a
compile error.
I am testing on CUDA 12, where it does not seem to work. Prior to my
commits, the CUTLASS kernels were also not working. So not sure what's
up.

In any case: consider this untested.
This indicates that the operator uses expensive operations (pow, exp,
log) in the inner loop. Therefore, unrolling and or veclen parameters
should be adjusted
@ahendriksen
Copy link
Contributor Author

sorry @ahendriksen, we had a high priority to fix the RAFT packaging for 23.04 (in time for GTC keynote) but this caused a bunch of conflicts in this PR. They should be straightforward to fix, though- the libraft source files directory changed from src/distance/.../ to /src/.../ so most of it is just renaming.

No problem! The merge was quite painless actually. Really looking forward to the GTC keynote and all the great progress we will make going forward!

Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor comments that I'll leave to you to decide if you wish to implement! LGTM

@cjnolet cjnolet added 5 - Ready to Merge and removed 5 - Merge After Dependencies Depends on another PR: do not merge out of order labels Mar 21, 2023
@ahendriksen
Copy link
Contributor Author

I think we have to wait for #1363 to clear so that the CI passes for this run.

@cjnolet
Copy link
Member

cjnolet commented Mar 22, 2023

@ahendriksen yep, and CI hasn't been particularly friendly lately.... Jobs have been getting queued up for several hours at a time. If it helps, you can merge that branch into this one, then at least this one will be up to date and pass through CI so we can merge it right after.

@ahendriksen ahendriksen requested review from a team as code owners March 22, 2023 16:31
@cjnolet
Copy link
Member

cjnolet commented Mar 22, 2023

We should be able to remove the need for the ops-codeowners review once the other PR gets merged

@cjnolet
Copy link
Member

cjnolet commented Mar 22, 2023

/merge

@github-actions github-actions bot removed the python label Mar 22, 2023
@rapids-bot rapids-bot bot merged commit 08e7012 into rapidsai:branch-23.04 Mar 23, 2023
rapids-bot bot pushed a commit that referenced this pull request Mar 25, 2023
This should cut compilation time for refine_d_int64_t_float.cu.o et al from ~900 seconds to 29 seconds.

The refine specialization contain >100 instances of the ivfflat_interleaved_scan kernel, even though these should be seperately compiled by the ivfflat_search specializations. 

The call to ivf_flat_interleaved_scan is [here](https://github.com/rapidsai/raft/blob/56ac43ad93a319a61073dce1b3b937f6f13ade63/cpp/include/raft/neighbors/detail/refine.cuh#L121). 

Depends on (so please merge after) PR #1307.

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #1360
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Build Time Improvement CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

3 participants