Support for precomputed distance matrix in DBSCAN #3585

Nyrio · 2021-03-04T20:18:14Z

Notes about performance

If we don't count the cost of pre-computing the distance matrix (which is done by the user), the single-GPU code runs slightly faster when the distance matrix is pre-computed. (note: this is 2d, greater speedups expected for larger dimensions!)

As I have stated in a comment in the code, it works with two kernels: one that uses a coalesced reduction to compute the vertex degrees from the distance matrix, and one that uses a 2D copy fused with an unary operation to get the boolean neighborhood matrix.

Note: the performance of this step could be better if adj was a row-major B*N matrix instead of column-major. We could fuse everything into one efficient kernel. It is something to keep in mind when we re-write csr_adj_graph_batched.

Notes about MNMG

Cf #3615

…ore robust to index overflows with the distance matrix, test multi-batch cases

tfeher

Thanks @Nyrio for the PR! It looks good in general, I have just marked a few smaller issues.

Please file an issue about the MNMG case with distributed data, to promote discussion on its priority.

python/cuml/test/test_dbscan.py

python/cuml/test/dask/test_dbscan.py

python/cuml/cluster/dbscan.pyx

cpp/src/dbscan/vertexdeg/precomputed.cuh

cpp/test/sg/dbscan_test.cu

python/cuml/test/test_dbscan.py

tfeher

Thanks @Nyrio for the update. Changes look good in general, there are only some minor things left:

Fix the failing test_base_children_get_param_names unit test by adding 'metric' to get_param_names() in dbscan.pyx.
The PR description will be added as a commit message of the merge commit. I think the TODO section can be removed from the description.
Consider linking the original feature request with the appropriate keyword.
Please file an issue about "distance matrix that is scattered across the nodes".

tfeher

Thanks Louis for the update! The PR looks good to me.

teju85

For official-sake, approving this PR based on the reviews done by @tfeher (he's not yet a cpp-codeowner). @dantegd or @JohnZed can we get python-side approval from one of you?

dantegd

Sorry for the delayed review, long week of reviews. Just had one question and one very minor suggestion to improve the python docstring, then it looks good in the Python side as well.

cpp/include/cuml/cluster/dbscan.hpp

python/cuml/cluster/dbscan.pyx

JohnZed · 2021-03-24T05:05:54Z

Blocked waiting on RAFT change

@dantegd

Suggested by @dantegd in: rapidsai/cuml#3585 (comment) Authors: - Louis Sugy (@Nyrio) Approvers: - Thejaswi. N. S (@teju85) URL: #177

Nyrio · 2021-03-24T19:17:13Z

@dantegd @teju85 I made the change to use raft::distance::DistanceType but I'm having a very weird issue in dbscan.pyx that I don't understand. If I use DistanceType.Precomputed, I get a compiler error:

UnicodeEncodeError: 'latin-1' codec can't encode character '\u0308' in position 15936: ordinal not in range(256)

If I comment line 259, it compiles. The problem doesn't seem to be a special character, I've tried rewriting this line from scratch, also it compiles when commented. I don't see why the enum value would cause that either, the value is 100, which is under 256.

I'm quite confused. Does someone know what's happening here?

dantegd · 2021-03-25T16:35:07Z

@Nyrio that's quite an odd issue, I would suggest merging branch-0.19 into the branch of the PR (to solve the copyright issues) and then we can see if the unicode error persists.

# Conflicts: # cpp/cmake/Dependencies.cmake

Nyrio · 2021-03-26T13:34:31Z

I've merged branch-0.19 but it unleashed a dependency nightmare. The latest changes seem to require Faiss 1.7.0 and the latest Rapids Docker images released a few hours ago come with Faiss 1.6.3. Forcing the update of the libfaiss package creates linking errors with other libraries...

Edit: it works with 11.2. The dependency issue was with 11.0

Nyrio · 2021-03-26T14:09:00Z

Update: I managed to solve my dependency issues with 11.2 but the problem of the odd codec error persists.

I've also tried using contiguous values in the enum, which didn't seem to work.

tfeher · 2021-03-26T14:52:29Z

I confirm the problem with the enum. Python install fails as long as we have the Precomputed name here.

A workaround is to rename the enum to Precalculated. With that it compiles for me, but that requires another RAFT PR.

Nyrio · 2021-03-26T16:09:50Z

It turns out it was an encoding bug, with an invisible character that didn't appear in the editor. Fixed.

codecov-io · 2021-03-26T19:10:10Z

Codecov Report

Merging #3585 (99797c5) into branch-0.19 (fd9ec89) will decrease coverage by 35.36%.
The diff coverage is 57.07%.

@@               Coverage Diff                @@
##           branch-0.19    #3585       +/-   ##
================================================
- Coverage        80.70%   45.34%   -35.37%     
================================================
  Files              227      224        -3     
  Lines            17615    17189      -426     
================================================
- Hits             14217     7794     -6423     
- Misses            3398     9395     +5997

Flag	Coverage Δ
dask	`45.34% <57.07%> (+0.35%)`	⬆️
non-dask	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/cuml/cluster/kmeans.pyx	`58.29% <ø> (-33.67%)`	⬇️
python/cuml/common/memory_utils.py	`65.85% <0.00%> (-13.27%)`	⬇️
python/cuml/fil/fil.pyx	`63.77% <ø> (-28.07%)`	⬇️
python/cuml/model_selection/_split.py	`5.58% <0.00%> (-84.78%)`	⬇️
python/cuml/neighbors/__init__.py	`100.00% <ø> (ø)`
python/cuml/neighbors/ann.pyx	`8.04% <0.00%> (-53.59%)`	⬇️
python/cuml/neighbors/nearest_neighbors.pyx	`41.47% <0.00%> (-51.18%)`	⬇️
python/cuml/pipeline/__init__.py	`0.00% <0.00%> (ø)`
python/cuml/preprocessing/encoders.py	`88.04% <ø> (-7.04%)`	⬇️
python/cuml/solvers/qn.pyx	`17.31% <7.14%> (-80.32%)`	⬇️
... and 150 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0883026...99797c5. Read the comment docs.

dantegd · 2021-03-30T01:45:07Z

@gpucibot merge

Nyrio added 2 commits March 3, 2021 10:52

Add support for precomputed in DBSCAN

8a98acd

Add dist_to_adj_transposed_kernel (not used yet)

be643a0

Nyrio requested review from a team as code owners March 4, 2021 20:18

github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Mar 4, 2021

Nyrio added 2 - In Progress Currenty a work in progress feature request New feature or request labels Mar 4, 2021

Fix style, compute better estimation of available memory, make code m…

d41cd85

…ore robust to index overflows with the distance matrix, test multi-batch cases

Nyrio changed the title ~~[WIP] Support for precomputed distance matrix in DBSCAN~~ Support for precomputed distance matrix in DBSCAN Mar 5, 2021

Nyrio added 3 - Ready for Review Ready for review by team non-breaking Non-breaking change and removed 2 - In Progress Currenty a work in progress labels Mar 5, 2021

tfeher requested changes Mar 11, 2021

View reviewed changes

Nyrio added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Mar 11, 2021

Nyrio added 3 commits March 12, 2021 08:02

Fix style, remove commented blocks, add precomputed C++ test, etc

66c4176

Merge branch 'branch-0.19' into fea-dbscan-precomputed

ced34e6

Style fix + test multi-batch in unit test, not only quality test

2401a02

Nyrio requested a review from tfeher March 12, 2021 16:50

Nyrio added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 4 - Waiting on Author Waiting for author to respond to review labels Mar 12, 2021

tfeher requested changes Mar 15, 2021

View reviewed changes

Nyrio mentioned this pull request Mar 15, 2021

[ENH] Support for scattered precomputed distance matrix for MNMG DBSCAN #3615

Open

Add 'metric' to get_param_names

1369829

Nyrio requested a review from tfeher March 15, 2021 19:01

tfeher approved these changes Mar 16, 2021

View reviewed changes

Nyrio added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Waiting on Reviewer Waiting for reviewer to review or respond labels Mar 16, 2021

teju85 approved these changes Mar 17, 2021

View reviewed changes

dantegd requested changes Mar 19, 2021

View reviewed changes

cpp/include/cuml/cluster/dbscan.hpp Outdated Show resolved Hide resolved

python/cuml/cluster/dbscan.pyx Outdated Show resolved Hide resolved

Nyrio mentioned this pull request Mar 19, 2021

Add Precomputed to the DistanceType enum (for cuML DBSCAN) rapidsai/raft#177

Merged

JohnZed added 4 - Waiting on Author Waiting for author to respond to review 0 - Blocked Cannot progress due to external reasons and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Mar 24, 2021

Change metric enum to raft::distance::DistanceType

edd5a0b

Nyrio requested a review from a team as a code owner March 24, 2021 19:11

github-actions bot added the CMake label Mar 24, 2021

Merge branch 'branch-0.19' into fea-dbscan-precomputed

4273c96

# Conflicts: # cpp/cmake/Dependencies.cmake

Fix encoding bug

99797c5

Nyrio added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 0 - Blocked Cannot progress due to external reasons 4 - Waiting on Author Waiting for author to respond to review labels Mar 26, 2021

dantegd approved these changes Mar 30, 2021

View reviewed changes

rapids-bot bot merged commit 4f4ae58 into rapidsai:branch-0.19 Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for precomputed distance matrix in DBSCAN #3585

Support for precomputed distance matrix in DBSCAN #3585

Nyrio commented Mar 4, 2021 •

edited

Loading

tfeher left a comment

tfeher left a comment

tfeher left a comment

teju85 left a comment

dantegd left a comment

JohnZed commented Mar 24, 2021

Nyrio commented Mar 24, 2021

dantegd commented Mar 25, 2021

Nyrio commented Mar 26, 2021 •

edited

Loading

Nyrio commented Mar 26, 2021

tfeher commented Mar 26, 2021

Nyrio commented Mar 26, 2021

codecov-io commented Mar 26, 2021

dantegd commented Mar 30, 2021

Support for precomputed distance matrix in DBSCAN #3585

Support for precomputed distance matrix in DBSCAN #3585

Conversation

Nyrio commented Mar 4, 2021 • edited Loading

Notes about performance

Notes about MNMG

tfeher left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

teju85 left a comment

Choose a reason for hiding this comment

dantegd left a comment

Choose a reason for hiding this comment

JohnZed commented Mar 24, 2021

Nyrio commented Mar 24, 2021

dantegd commented Mar 25, 2021

Nyrio commented Mar 26, 2021 • edited Loading

Nyrio commented Mar 26, 2021

tfeher commented Mar 26, 2021

Nyrio commented Mar 26, 2021

codecov-io commented Mar 26, 2021

Codecov Report

dantegd commented Mar 30, 2021

Nyrio commented Mar 4, 2021 •

edited

Loading

Nyrio commented Mar 26, 2021 •

edited

Loading