Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with high min_cluster_size - struct.error 'i' format requires.... #250

Open
birdsarah opened this issue Nov 7, 2018 · 3 comments
Open

Comments

@birdsarah
Copy link

I was working with data that has ~100,000 rows, and 2 columns. I was exploring increasing min_cluster_size up to high numbers to watch the effect. At min_cluster_size=5000 I got the following error, which surprised me somewhat.

I'm really not sure what's happening, or whether this is even an issue I should be reporting to HDBSCAN, so feel free to close if it doesn't look relevant.

Here are the plots for N=3000 and N=4000, that large cluster is very big so I was expecting this to work.

min_samples was set to None.

screenshot from 2018-11-07 15-14-20

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py", line 346, in _sendback_result
    exception=exception))
  File "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/externals/loky/backend/queues.py", line 241, in put
    self._writer.send_bytes(obj)
  File "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

error                                     Traceback (most recent call last)
<ipython-input-62-5c6dbedb6337> in <module>
      7         min_cluster_size=N,
      8         min_samples=None
----> 9     ).fit_predict(embedding_df)
     10     labels_df[labels_col] = labels
     11 

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/hdbscan/hdbscan_.py in fit_predict(self, X, y)
    874             cluster labels
    875         """
--> 876         self.fit(X)
    877         return self.labels_
    878 

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
    852          self._condensed_tree,
    853          self._single_linkage_tree,
--> 854          self._min_spanning_tree) = hdbscan(X, **kwargs)
    855 
    856         if self.prediction_data:

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    572                                              approx_min_span_tree,
    573                                              gen_min_span_tree,
--> 574                                              core_dist_n_jobs, **kwargs)
    575         else:  # Metric is a valid BallTree metric
    576             # TO DO: Need heuristic to decide when to go to boruvka;

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py in __call__(self, *args, **kwargs)
    327 
    328     def __call__(self, *args, **kwargs):
--> 329         return self.func(*args, **kwargs)
    330 
    331     def call_and_shelve(self, *args, **kwargs):

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
    251                                  leaf_size=leaf_size // 3,
    252                                  approx_min_span_tree=approx_min_span_tree,
--> 253                                  n_jobs=core_dist_n_jobs, **kwargs)
    254     min_spanning_tree = alg.spanning_tree()
    255     # Sort edges of the min_spanning_tree by weight

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    994 
    995             with self._backend.retrieval_context():
--> 996                 self.retrieve()
    997             # Make sure that we get a last message telling us we are done
    998             elapsed_time = time.time() - self._start_time

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
    897             try:
    898                 if getattr(self._backend, 'supports_timeout', False):
--> 899                     self._output.extend(job.get(timeout=self.timeout))
    900                 else:
    901                     self._output.extend(job.get())

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    515         AsyncResults.get from multiprocessing."""
    516         try:
--> 517             return future.result(timeout=timeout)
    518         except LokyTimeoutError:
    519             raise TimeoutError()

~/miniconda3/envs/ovscrptd/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

~/miniconda3/envs/ovscrptd/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

error: 'i' format requires -2147483648 <= number <= 2147483647
@birdsarah
Copy link
Author

birdsarah commented Nov 7, 2018

conda env pieces (let me know if you need more):

scikit-learn              0.20.0           py36h4989274_1 
hdbscan                   0.8.18           py36h7eb728f_0    conda-forge
python                    3.6.6                h5001a0f_3    conda-forge

@lmcinnes
Copy link
Collaborator

lmcinnes commented Nov 7, 2018 via email

@birdsarah
Copy link
Author

No stress. I definitely am not going to use the values in this range, I was just curious and the error felt like maybe it could be reported more clearly so I figured I'd post it.

Thanks so much for the swift reply.

Keep up the great work @lmcinnes. Am using hdbscan and umap extensively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants