Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop using boston housing dataset #6180

Open
wants to merge 1 commit into
base: branch-25.02
Choose a base branch
from

Conversation

betatim
Copy link
Member

@betatim betatim commented Dec 12, 2024

Instead we use the california housing dataset.

closes #5158

Copy link

copy-pr-bot bot commented Dec 12, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Dec 12, 2024
Instead we use the california housing dataset.
@betatim betatim marked this pull request as ready for review December 12, 2024 13:52
@betatim betatim requested a review from a team as a code owner December 12, 2024 13:52
@betatim betatim requested review from bdice and viclafargue December 12, 2024 13:52
@bdice
Copy link
Contributor

bdice commented Dec 12, 2024

Looks like there are some failures in hdbscan with the new dataset:

=========================== short test summary info ============================
FAILED test_hdbscan.py::test_hdbscan_sklearn_datasets[housing_dataset-knn-leaf-True-min_samples_cluster_size_bounds1-0.0] - AssertionError: 
Arrays are not almost equal to -25 decimals

(shapes (47,), (48,) mismatch)
 ACTUAL: array([   25,    27,    28,    29,    29,    29,    31,    32,    33,
          35,    36,    36,    40,    41,    43,    45,    47,    49,
          50,    54,    61,    62,    65,    71,    72,    72,    78,...
 DESIRED: array([   25,    25,    25,    26,    29,    30,    30,    31,    32,
          33,    35,    35,    35,    37,    40,    41,    44,    44,
          47,    49,    50,    57,    60,    61,    64,    71,    71,...
FAILED test_hdbscan.py::test_hdbscan_sklearn_datasets[housing_dataset-knn-leaf-False-min_samples_cluster_size_bounds1-0.0] - AssertionError: 
Arrays are not almost equal to -25 decimals

(shapes (47,), (48,) mismatch)
 ACTUAL: array([   25,    27,    28,    29,    29,    29,    31,    32,    33,
          35,    36,    36,    40,    41,    43,    45,    47,    49,
          50,    54,    61,    62,    65,    71,    72,    72,    78,...
 DESIRED: array([   25,    25,    25,    26,    29,    30,    30,    31,    32,
          33,    35,    35,    35,    37,    40,    41,    44,    44,
          47,    49,    50,    57,    60,    61,    64,    71,    71,...
= 2 failed, 14387 passed, 6025 skipped, 646 xfailed, 66 xpassed, 1152 warnings in 4889.78s (1:21:29) =

https://github.com/rapidsai/cuml/actions/runs/12297823749/job/34322294951?pr=6180#step:8:8331

@dantegd
Copy link
Member

dantegd commented Dec 13, 2024

@betatim can you open an issue with the log of the error?

I think we should use this PR to debug that failure. Specifically, this fails with this specific set of relevant parameters:

  • cluster_selection_method=leaf not eom
  • cluster_selection_epsilon=0.0 not the larger ones
  • min_samples=50
  • min_cluster_size=25

This combination of parameters should give us a clue when debugging. Important that it is with cluster_selection_epsilon=0.0, which means it'd affect out of sample inference.

On a very first quick thought, this seems to suggest to me a bug either in the leaf code in

void leaf(const raft::handle_t& handle,
or somewhere else in the cluster selection logic.

@dantegd
Copy link
Member

dantegd commented Dec 13, 2024

Something additional that always frustrates me about assert in pytests, I'd love to know beforehand the results of the following assertions (calculating adjusted_rand_score and checking for matching cluster_persistence_). In more recent tests, I've been moving away from a simple assert/failing on first thing, and accumulating the errors in an error list to fail and display at the end of the test:

# Report all errors at the end

Might be useful to addopt that approach in general, will open an issue to capture that as well.

@betatim
Copy link
Member Author

betatim commented Dec 13, 2024

An interesting thing that I didn't spot initially is that the length of the arrays is different. So isn't just the number of samples assigned to each cluster that differs, but also the number of clusters in total (I think).

Poking around a bit to understand what is what

@betatim
Copy link
Member Author

betatim commented Dec 13, 2024

adjusted_rand_score(cuml_agg.labels_, sk_agg.labels_)=0.9694956618348944

@divyegala
Copy link
Member

divyegala commented Dec 17, 2024

@betatim @dantegd

Here's a full sequence of min_samples values that occur in the mutual reachability graph for the last data point in the housing dataset:

10.9316,10.9316,10.9316,10.9316,13.351,11.2694,10.9316,15.2398,10.9316,10.9316,11.3027,12.227,11.1692,10.9772,13.3884,10.9316,13.3978,11.4673,12.6491,11.3798,10.9316,10.9316,10.9316,11.2694,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,12.1347,15.6445,10.9316,18.2688,11.8954,10.9316,11.5434,12.052,10.9316,11.2138,10.9316,12.5399,10.9316,10.9316,10.9316,10.9316,10.9316,11.3248,10.9316,10.9316,14.2653,10.9316,11.1131,11.6833,10.9316,10.9316,12.0934,10.9316,11.2027,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,3.40282e+38

where the min of this array ,10.9316, is highly repetitive. In fact, there appear to be very few unique edge weights for this data point. Thus, no parallel MST can be deterministic as you could select any edge weight with 10.9316 to build the MST. A sequential MST, however, would be deterministic and always select the same edge despite the repetitions (most likely the first instance of 10.9316). Once this difference occurs, we are looking at a possibility of completely different dendrograms being constructed, which leads to very different final solutions.

NOTE: I can also verify with an eye-test that almost all data points in the mutual reachability graph seem to have similar edge weight patterns.

@betatim
Copy link
Member Author

betatim commented Dec 19, 2024

The take away here is that we can't make the current test pass and that is a feature (not a bug)? Because cuml uses a parallel MST structure we can't promise to break ties in the same way as a sequential implementation.

If yes, I see two(+1) options:

  1. special case the california housing dataset to only check e.g. the score but not the exact clustering
  2. in general loosen the requirements in the test because we are just being lucky that it passes for the other datasets
  3. try other datasets until we find one that passes the test

WDYT?

@divyegala
Copy link
Member

@betatim I would want to try a dataset with fewer repetitive patterns. California housing dataset has a bunch of repetitive values in the first 3 columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython / Python Cython or Python issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Remove usage of Boston dataset in test files
4 participants