-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Enable HDBSCAN to build knn graph using NN Descent #5939
base: branch-24.10
Are you sure you want to change the base?
Conversation
Adding distance epilogue to NN Descent. Planning to use them for the following use cases as of now; - Calculating mutual reachability distance for HDBSCAN (possibly for usage with HDBSCAN - [related PR here](rapidsai/cuml#5939)) - Enabling `L2SqrtExpanded` distance metric, by `sqrt`-ing the current supported metric (`L2Expanded`) of NN Descent in the distance epilogue. (for usage with UMAP - [related PR here](rapidsai/cuml#5910)) Authors: - Jinsol Park (https://github.com/jinsolp) Approvers: - Tarang Jain (https://github.com/tarang-jain) - Tamas Bela Feher (https://github.com/tfeher) - Divye Gala (https://github.com/divyegala) URL: #2364
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looked the Python side, LGTM. Congrats Jinsol!
if self.build_algo == "auto": | ||
if self.n_rows <= 50000: | ||
# brute force is faster for small datasets | ||
logger.warn("Building knn graph using brute force") | ||
self.build_algo = "brute_force_knn" | ||
else: | ||
logger.warn("Building knn graph using nn descent") | ||
self.build_algo = "nn_descent" | ||
|
||
if self.build_algo == 'brute_force_knn': | ||
params.build_algo = GRAPH_BUILD_ALGO.BRUTE_FORCE_KNN | ||
elif self.build_algo == 'nn_descent': | ||
params.build_algo = GRAPH_BUILD_ALGO.NN_DESCENT | ||
if self.build_kwds is None: | ||
params.nn_descent_params.graph_degree = <size_t> 64 | ||
params.nn_descent_params.intermediate_graph_degree = <size_t> 128 | ||
params.nn_descent_params.max_iterations = <size_t> 20 | ||
params.nn_descent_params.termination_threshold = <float> 0.0001 | ||
params.nn_descent_params.return_distances = <bool> True | ||
else: | ||
params.nn_descent_params.graph_degree = <size_t> self.build_kwds.get("nnd_graph_degree", 64) | ||
params.nn_descent_params.intermediate_graph_degree = <size_t> self.build_kwds.get("nnd_intermediate_graph_degree", 128) | ||
params.nn_descent_params.max_iterations = <size_t> self.build_kwds.get("nnd_max_iterations", 20) | ||
params.nn_descent_params.termination_threshold = <float> self.build_kwds.get("nnd_termination_threshold", 0.0001) | ||
params.nn_descent_params.return_distances = <bool> self.build_kwds.get("nnd_return_distances", True) | ||
else: | ||
raise ValueError("Build algo not supported. " | ||
"Must one of {'brute_force_knn', 'nn_descent'}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to create a function to avoid repeating this section of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree w/ @viclafargue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry Jinsol! I reviewed this awhile back and didn't submit. I think I was still working my way through all of your changes but let me submit this to start.
} | ||
}; | ||
|
||
template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create an issue in raft to eventually move these functions over to raft::matrix
? They are certainly useful. Please reference the issue in a comment here.
|
||
DI value_t operator()(value_t value, value_idx row, value_idx col) const | ||
{ | ||
return max(core_dists[col], max(core_dists[row], powf(fabsf(alpha * value), 0.5))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqrt is pretty expensive, which is why we like to hold off on doing until the end (and only if a user needs it). The thing about the sqrt is that it since it affects all the distances and since it doesn't change the relative ordering of the distances, we can usually avoid having to do it in place. Just a thougt- you might be able to save some precious compute cycles here.
"n_neighbors should be smaller than the graph degree computed by nn descent"); | ||
|
||
auto epilogue = ReachabilityPostProcessSqrt<value_idx, value_t>(core_dists, alpha); | ||
build_params.return_distances = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost wonder if we should add this to all the nn algos eventually. Scikit-learn has this exact option, for example, which applies to all the nn estimators. Can you create a cuVS issue for this? You don't have to reference it in the code here.
cdef index_params build_params | ||
|
||
if self.build_algo == "auto": | ||
if self.n_rows <= 50000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this in multiple places is a good indicator that we should create a new function for it so we can consolidate the similarities (and differences).
/ok to test |
/ok to test |
Description
Adds
build_algo
option to HDBSCAN, and allow HDBSCAN to build knn graphs using nn descentNow user can choose the knn graph build algorithm between "brute_force_knn" and "nn_descent"
Defaults to "auto", in which case decides to run with brute force knn or nn descent depending on the given dataset size.
"auto" decides to run with brute_force_knn if data has <= 50K rows. Otherwise decides to run with nn_descent.
Running Benchmarks