Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch 23.06 merge 23.04 #1379

Merged
merged 26 commits into from
Mar 30, 2023
Merged

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Mar 29, 2023

Replaces #1368

benfred and others added 22 commits March 18, 2023 02:14
Remove faiss_mr.hpp - in favour of rapidsai/cuml#5281

Authors:
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1351
This depends on rapidsai#1202 so those changes are also included here. This should also not be merged until that PR is merged.

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Ben Frederickson (https://github.com/benfred)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#1340
…rapidsai#1333)

Now that we no longer need to rely on the FAISS dependency, this PR:
1. Consolidates  the `libraft-distance` and `libraft-nn` targets into a single `libraft` artifact
2. Introduces a new `raft::compiled` target to link against the new `libraft.so` binary. Removes `raft::distance`and `raft::nn` targets.
3. Consolidates `RAFT_DISTANCE_COMPILED` and `RAFT_NN_COMPILED` pre-processor vars into a single `RAFT_COMPILED` (to match similar pattern implementated by `spdlog`)
4. Consolidates `specializations.cuh` headers
5. Updates all docs, scripts, and build infra

This change has been a long time coming and is intended to be a 23.04 feature. This is further going to require updates to several projects downstream. Here's a checklist to track that progress:
- [x] offline announcement
- [x] cuml rapidsai/cuml#5272
- [x] cugraph rapidsai/cugraph#3348
- [x] cuopt rapidsai/cuopt#1023
- [x] cugraph-ops rapidsai/cugraph-ops#429


This PR depended on rapidsai#1340 (removing FAISS from the build) and on rapidsai#1202 (replacing the FAISS bfknn w/ our own), both of which have been merged. 

Closes rapidsai#824

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Sevag H (https://github.com/sevagh)
  - Divye Gala (https://github.com/divyegala)
  - Ben Frederickson (https://github.com/benfred)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#1333
…idsai#1358)

The ivf_flat specialization header declarations used a wrong index type. 

The specializations for ivf flat are defined for int64_t. The raft_runtime interface also uses the int64_t instances. The ivf_flat specialization header, however, declared an interface using uint64_t.

This is fixed with this PR. Should also reduce compile times for `src/distance/neighbors/ivf_flat_search.cu.o`

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Divye Gala (https://github.com/divyegala)

URL: rapidsai#1358
This PR removes modification of the `__init__.py::version` attribute that occurs during the wheel build process. See rapidsai/ops#2592 for more information.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Sevag H (https://github.com/sevagh)

URL: rapidsai#1359
In dask/distributed#7580 get_worker was modified to return the worker of a task, thus it cannot be used by client.run, and we must now use dask_worker as the first argument to client.run to obtain the worker.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: rapidsai#1365
Following the findings in https://github.com/ahendriksen/raft/tree/investigate-compile-time-reduction-strategies#investigation-of-compile-times, this PR reduces the compile times of the pairwise distance specializations.
This is achieved by:
1. Reducing the number of included files in the translation units where kernels are instantiated, specifically `spdlog` and `rmm` are avoided. 
2. Limiting loop unrolling in kernels with expensive operations in the inner loop.

Additional improvements geared towards iterative development:
1. The tests do not have to be recompiled when the internals of a pairwise distance kernel change. Before, a rebuilt was triggered due an include of `raft/distance/distance.cuh`.
2. Addition of a  fine tuning benchmark for the pairwise distance kernels that separates building the kernel from the benchmark code. This dramatically speeds up development. Compiling an empty benchmark takes roughly 18 seconds on my machine. Whereas recompiling a kernel takes ~3.8 seconds. Without this addition, a commit like 35a2ad4 would require substantially more time to make sure that performance is not degraded. 

![image](https://user-images.githubusercontent.com/4172822/225383120-5f8a82f9-0b46-4c39-bc1d-7b2a0551e881.png)

```
Parallel build time before: 270 seconds (6 cores, SMT, 12 jobs)
Parallel build time before: 147 seconds (6 cores, SMT, 12 jobs)

Sum of compile times before: 3022.6 seconds
Sum of compile times after:  1816.2 seconds

Comparison of compile times between headers and compiled: 
path                                                                         before (s)     after (s)  change (s) change (%)
pairwise_test                                                                    None        0.486       None     None
ance/distance/specializations/detail/lp_unexpanded_double_double_double_int.cu.o  101.1          10.3       -90.8     -89.8%
src/distance/distance/specializations/detail/canberra_float_float_float_int.cu.o   52.9           6.3       -46.6     -88.0%
/distance/distance/specializations/detail/canberra_double_double_double_int.cu.o   48.5           6.4       -42.1     -86.8%
stance/distance/specializations/detail/jensen_shannon_float_float_float_int.cu.o   65.3          10.4       -55.0     -84.1%
istance/distance/specializations/detail/kl_divergence_float_float_float_int.cu.o   70.2          12.6       -57.6     -82.0%
stance/distance/specializations/detail/correlation_double_double_double_int.cu.o   46.7           8.9       -37.8     -80.9%
distance/specializations/detail/hellinger_expanded_double_double_double_int.cu.o   41.6           8.1       -33.5     -80.6%
nce/distance/specializations/detail/jensen_shannon_double_double_double_int.cu.o   74.6          15.1       -59.5     -79.7%
ir/src/distance/distance/specializations/detail/l1_double_double_double_int.cu.o   40.9           8.4       -32.5     -79.4%
ance/distance/specializations/detail/l2_unexpanded_double_double_double_int.cu.o   40.7           8.6       -32.1     -78.8%
distance/specializations/detail/hamming_unexpanded_double_double_double_int.cu.o   40.8           9.0       -31.7     -77.8%
istance/distance/specializations/detail/lp_unexpanded_float_float_float_int.cu.o   45.9          10.2       -35.7     -77.8%
src/distance/distance/specializations/detail/l_inf_double_double_double_int.cu.o   41.2           9.5       -31.8     -77.0%
istance/distance/specializations/detail/russel_rao_double_double_double_int.cu.o   29.5           7.2       -22.3     -75.6%
t.dir/src/distance/distance/specializations/detail/l1_float_float_float_int.cu.o   47.3          13.2       -34.1     -72.2%
ce/distance/specializations/detail/hamming_unexpanded_float_float_float_int.cu.o   47.0          13.3       -33.7     -71.6%
/distance/distance/specializations/detail/correlation_float_float_float_int.cu.o   49.4          14.1       -35.3     -71.5%
ce/distance/specializations/detail/hellinger_expanded_float_float_float_int.cu.o   43.6          12.5       -31.1     -71.4%
c/distance/distance/specializations/detail/russel_rao_float_float_float_int.cu.o   28.5           8.2       -20.3     -71.2%
ance/distance/specializations/detail/kl_divergence_double_double_double_int.cu.o   75.8          21.9       -53.9     -71.1%
istance/distance/specializations/detail/l2_unexpanded_float_float_float_int.cu.o   46.2          13.5       -32.7     -70.7%
ir/src/distance/distance/specializations/detail/l_inf_float_float_float_int.cu.o   43.1          12.7       -30.4     -70.6%
stance/distance/specializations/detail/l2_expanded_double_double_double_int.cu.o   52.3          24.9       -27.3     -52.3%
/distance/distance/specializations/detail/l2_expanded_float_float_float_int.cu.o   75.8          40.3       -35.5     -46.8%
rc/distance/distance/specializations/detail/cosine_double_double_double_int.cu.o   53.5          28.7       -24.8     -46.4%
r/src/distance/distance/specializations/detail/cosine_float_float_float_int.cu.o   83.9          50.1       -33.8     -40.3%
CMakeFiles/pairwise_test.dir/test/distance/fused_l2_nn.cu.o                        85.1          64.1       -21.1     -24.7%
wise_test.dir/src/distance/distance/specializations/fused_l2_nn_float_int64.cu.o   56.2          42.9       -13.3     -23.6%
irwise_test.dir/src/distance/distance/specializations/fused_l2_nn_float_int.cu.o   52.5          40.2       -12.3     -23.5%
CMakeFiles/pairwise_test.dir/test/distance/dist_lp_unexp.cu.o                      56.3          43.3       -13.0     -23.1%
CMakeFiles/pairwise_test.dir/test/distance/dist_russell_rao.cu.o                   55.7          44.0       -11.7     -21.0%
rwise_test.dir/src/distance/distance/specializations/fused_l2_nn_double_int.cu.o   45.3          36.4        -9.0     -19.8%
CMakeFiles/pairwise_test.dir/test/distance/dist_l2_unexp.cu.o                      54.6          44.1       -10.6     -19.3%
CMakeFiles/pairwise_test.dir/test/distance/dist_canberra.cu.o                      51.6          42.1        -9.6     -18.6%
CMakeFiles/pairwise_test.dir/test/distance/dist_l2_exp.cu.o                        53.1          43.4        -9.6     -18.2%
CMakeFiles/pairwise_test.dir/test/distance/dist_l_inf.cu.o                         53.2          43.9        -9.3     -17.5%
CMakeFiles/pairwise_test.dir/test/distance/dist_hellinger.cu.o                     53.1          44.0        -9.0     -17.0%
CMakeFiles/pairwise_test.dir/test/distance/dist_hamming.cu.o                       52.3          43.4        -8.9     -17.0%
CMakeFiles/pairwise_test.dir/test/distance/dist_l2_sqrt_exp.cu.o                   54.0          45.6        -8.4     -15.6%
CMakeFiles/pairwise_test.dir/test/distance/dist_l1.cu.o                            52.6          44.5        -8.1     -15.4%
CMakeFiles/pairwise_test.dir/test/distance/dist_kl_divergence.cu.o                 52.4          44.7        -7.7     -14.8%
ise_test.dir/src/distance/distance/specializations/fused_l2_nn_double_int64.cu.o   43.5          37.2        -6.4     -14.7%
CMakeFiles/pairwise_test.dir/test/distance/dist_cos.cu.o                           52.4          44.8        -7.6     -14.5%
CMakeFiles/pairwise_test.dir/test/distance/dist_jensen_shannon.cu.o                53.2          45.7        -7.6     -14.2%
CMakeFiles/pairwise_test.dir/test/distance/dist_inner_product.cu.o                 51.1          44.8        -6.3     -12.4%
istance/distance/specializations/detail/inner_product_float_float_float_int.cu.o   39.5          35.1        -4.5     -11.3%
CMakeFiles/pairwise_test.dir/test/distance/dist_correlation.cu.o                   51.7          46.8        -4.9     -9.5%
ance/distance/specializations/detail/inner_product_double_double_double_int.cu.o   37.1          33.9        -3.1     -8.5%
src/distance/distance/specializations/detail/kernels/gram_matrix_base_float.cu.o   45.3          41.7        -3.6     -8.0%
rc/distance/distance/specializations/detail/kernels/gram_matrix_base_double.cu.o   42.5          39.6        -2.9     -6.8%
stance/distance/specializations/detail/kernels/polynomial_kernel_double_int.cu.o   40.4          38.5        -1.9     -4.8%
CMakeFiles/pairwise_test.dir/test/distance/dist_adj.cu.o                          123.3         117.8        -5.4     -4.4%
CMakeFiles/pairwise_test.dir/test/distance/gram.cu.o                               55.3          53.4        -1.9     -3.5%
build.ninja                                                                         4.0           4.0        +0.0     +0.1%
istance/distance/specializations/detail/kernels/polynomial_kernel_float_int.cu.o   45.2          45.6        +0.4     +0.8%
.dir/src/distance/distance/specializations/detail/kernels/tanh_kernel_float.cu.o   45.2          46.0        +0.8     +1.7%
dir/src/distance/distance/specializations/detail/kernels/tanh_kernel_double.cu.o   39.0          39.8        +0.8     +2.1%
CMakeFiles/pairwise_test.dir/src/distance/distance/pairwise_distance.cu.o          39.6          50.1       +10.5     +26.6%
```

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Divye Gala (https://github.com/divyegala)

URL: rapidsai#1307
raft-dask requires nccl to build, add to the dependencies.yaml so that when creating a clean raft conda environment - we can build all of raft out of the box

Authors:
  - Ben Frederickson (https://github.com/benfred)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: rapidsai#1361
This PR uses dependencies.yaml to generate the dependency lists in pyproject.toml

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1364
UCX 1.14.0 was recently released and conda-forge package was updated in conda-forge/ucx-split-feedstock#111 with several packaging improvements. Relax the pin to allow installing UCX v1.14.x as well.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: rapidsai#1366
I noticed some folks on Twitter asking about an architecture diagram for RAFT. I think it's a good idea to provide this in the README.

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Ben Frederickson (https://github.com/benfred)
  - Divye Gala (https://github.com/divyegala)

URL: rapidsai#1370
The main changes are:

- Add a one-block version. It uses single thread block for one row of a batch and is used when `len` is relatively small (<= 16384)
- Avoid writing candidates to buffers when the number of candidates is larger than buffer length.
- Add a parameter to control whether to use a fused filter in the last pass or use a standalone filter kernel. The later case is preferable when the leading bits of inputs are almost same.
- Early stopping: when the target bucket contains `k` values, we can stop the computation earlier
- Many implementation details are polished, like the initialization of `counter`, calculation of kernel launch parameters, and the scan step
- Tests and benchmarks are updated to include the new implementations. New benchmarks are added to demonstrate the advantage of adaptive version.

Authors:
  - Yong Wang (https://github.com/yong-wang)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Tamas Bela Feher (https://github.com/tfeher)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#1175
This is a copy and modification of a user's project but I think this is going to be generally useful to users as the same types of challenges are going to come up again. In this case, the user wasn't able to build/link because they weren't using `rapids-cmake` to propagate important configuration settings. I think having a skeleton project available that we build in CI and keep up to date will help new users build more applications on RAFT. 


TODO:
- [x] Make building the template optional 
- [x] Verify this can build in CMake and reuse already built/installed bits
- [x] Add to docs / readme and reference in README.md
- [x] Add a little example of invoking an API (maybe `pairwise_distances`?) to `main()`

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Micka (https://github.com/lowener)
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Divye Gala (https://github.com/divyegala)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: rapidsai#1312
This should cut compilation time for refine_d_int64_t_float.cu.o et al from ~900 seconds to 29 seconds.

The refine specialization contain >100 instances of the ivfflat_interleaved_scan kernel, even though these should be seperately compiled by the ivfflat_search specializations. 

The call to ivf_flat_interleaved_scan is [here](https://github.com/rapidsai/raft/blob/56ac43ad93a319a61073dce1b3b937f6f13ade63/cpp/include/raft/neighbors/detail/refine.cuh#L121). 

Depends on (so please merge after) PR rapidsai#1307.

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1360
Add the ability for a user to specify an epilogue function to run after the distance in the brute_force::knn call.

This lets us remove faiss from cuml, by updating the hdbscan reachability code (rapidsai/cuml#5293)

Authors:
  - Ben Frederickson (https://github.com/benfred)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1371
New `bench/ann` artifact for comparing (C++ APIs for) GPU-acclerated algorithms end-to-end. Working on this w/ @tfeher but had to squash the original commits into a single commit.

Things left to do:
- [x] Separate `benchmarks` executables for each different algorithm
- [x] Separate build targets for `ggnn` and `hnswlib`
- [x] Revise `bench/ann` docs
- [x] Break `factory.cuh` abd `benchmark.cu` / `benchmark.cpp` into individual files for each different algorithm to make it easier to plug in new algorithms. 
- [x] Separate into its own conda package

closes rapidsai#1211

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)
  - Ben Frederickson (https://github.com/benfred)
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#1304
Add `--time` option to `build.sh` that enables compile time logging of `nvcc`. 

Also, add a script `cpp/scripts/analyze_nvcc_log.py` to find the translation units that take the longest time. 

Output looks like: 

```
$ cpp/scripts/analyze_nvcc_log.py cpp/build/nvcc_compile_log.csv
-- loading data
-- analyzing data
-- Ten longest translation units:
phase  index                                               file        cicc   cudafe++  fatbinary  gcc (compiling)  gcc (preprocessing 1)  gcc (preprocessing 4)        ptxas   total time
0         10  ions/detail/canberra_double_double_double_int.cu    42.431063  10.601856   0.020979         6.747153               3.721194               2.093567  1618.390375  1684.006186
1         11  zations/detail/canberra_float_float_float_int.cu    36.928960   9.804138   0.011537         6.796088               3.481156               1.790703  1584.262875  1643.075457
2         85  ors/specializations/refine_d_uint64_t_uint8_t.cu   602.935531  14.980877   0.529673        36.300566               6.270717               2.889723   933.622969  1597.530056
3         84  bors/specializations/refine_d_uint64_t_int8_t.cu   606.513281  16.243960   0.729282        39.981113               5.608029               3.028493   897.241469  1569.345628
4         53  stance/neighbors/ivfpq_search_int8_t_uint64_t.cu   841.049750   8.233967   1.025554        24.248578               4.069022               1.747108   631.193734  1511.567713
5         52  istance/neighbors/ivfpq_search_float_uint64_t.cu   837.241437   8.145278   1.042313        24.400606               3.433528               1.882623   627.786672  1503.932457
6         54  tance/neighbors/ivfpq_search_uint8_t_uint64_t.cu   846.706656   8.371286   1.025517        24.094691               3.432749               1.645345   618.319234  1503.595479
7         76  izations/detail/ivfpq_search_uint8_t_uint64_t.cu   698.726266   7.086368   1.050021        39.727723               3.259101               1.333935   406.509937  1157.693351
8         74  alizations/detail/ivfpq_search_float_uint64_t.cu   706.702516   6.905794   1.049731        39.923895               2.814361               2.057154   395.604000  1155.057450
9         75  lizations/detail/ivfpq_search_int8_t_uint64_t.cu   689.390281   6.483386   1.025864        39.865668               3.121696               1.297788   409.099562  1150.284245
10        83  hbors/specializations/refine_d_uint64_t_float.cu   334.705594  15.466444   0.680270        36.551977               5.405133               2.947568   715.708781  1111.465767
-- Plotting absolute compile times
-- Wrote absolute compile time plot to cpp/build/nvcc_compile_log.csv.absolute.compile_times.png
-- Plotting relative compile times
-- Wrote relative compile time plot to cpp/build/nvcc_compile_log.csv.relative.compile_times.png
```

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1262
@vyasr vyasr requested review from a team as code owners March 29, 2023 21:18
@vyasr vyasr self-assigned this Mar 29, 2023
@vyasr vyasr added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 29, 2023
@vyasr vyasr changed the base branch from branch-23.04 to branch-23.06 March 29, 2023 21:19
@github-actions github-actions bot added the ci label Mar 30, 2023
Copy link
Contributor

@msadang msadang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@msadang msadang merged commit b1089aa into rapidsai:branch-23.06 Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants