Error with high min_cluster_size - struct.error 'i' format requires.... #250

birdsarah · 2018-11-07T21:18:07Z

I was working with data that has ~100,000 rows, and 2 columns. I was exploring increasing min_cluster_size up to high numbers to watch the effect. At min_cluster_size=5000 I got the following error, which surprised me somewhat.

I'm really not sure what's happening, or whether this is even an issue I should be reporting to HDBSCAN, so feel free to close if it doesn't look relevant.

Here are the plots for N=3000 and N=4000, that large cluster is very big so I was expecting this to work.

min_samples was set to None.

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py", line 346, in _sendback_result
    exception=exception))
  File "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/externals/loky/backend/queues.py", line 241, in put
    self._writer.send_bytes(obj)
  File "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/bird/miniconda3/envs/ovscrptd/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

error                                     Traceback (most recent call last)
<ipython-input-62-5c6dbedb6337> in <module>
      7         min_cluster_size=N,
      8         min_samples=None
----> 9     ).fit_predict(embedding_df)
     10     labels_df[labels_col] = labels
     11 

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/hdbscan/hdbscan_.py in fit_predict(self, X, y)
    874             cluster labels
    875         """
--> 876         self.fit(X)
    877         return self.labels_
    878 

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
    852          self._condensed_tree,
    853          self._single_linkage_tree,
--> 854          self._min_spanning_tree) = hdbscan(X, **kwargs)
    855 
    856         if self.prediction_data:

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    572                                              approx_min_span_tree,
    573                                              gen_min_span_tree,
--> 574                                              core_dist_n_jobs, **kwargs)
    575         else:  # Metric is a valid BallTree metric
    576             # TO DO: Need heuristic to decide when to go to boruvka;

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py in __call__(self, *args, **kwargs)
    327 
    328     def __call__(self, *args, **kwargs):
--> 329         return self.func(*args, **kwargs)
    330 
    331     def call_and_shelve(self, *args, **kwargs):

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
    251                                  leaf_size=leaf_size // 3,
    252                                  approx_min_span_tree=approx_min_span_tree,
--> 253                                  n_jobs=core_dist_n_jobs, **kwargs)
    254     min_spanning_tree = alg.spanning_tree()
    255     # Sort edges of the min_spanning_tree by weight

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    994 
    995             with self._backend.retrieval_context():
--> 996                 self.retrieve()
    997             # Make sure that we get a last message telling us we are done
    998             elapsed_time = time.time() - self._start_time

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
    897             try:
    898                 if getattr(self._backend, 'supports_timeout', False):
--> 899                     self._output.extend(job.get(timeout=self.timeout))
    900                 else:
    901                     self._output.extend(job.get())

~/miniconda3/envs/ovscrptd/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    515         AsyncResults.get from multiprocessing."""
    516         try:
--> 517             return future.result(timeout=timeout)
    518         except LokyTimeoutError:
    519             raise TimeoutError()

~/miniconda3/envs/ovscrptd/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

~/miniconda3/envs/ovscrptd/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

error: 'i' format requires -2147483648 <= number <= 2147483647

The text was updated successfully, but these errors were encountered:

birdsarah · 2018-11-07T21:20:04Z

conda env pieces (let me know if you need more):

scikit-learn              0.20.0           py36h4989274_1 
hdbscan                   0.8.18           py36h7eb728f_0    conda-forge
python                    3.6.6                h5001a0f_3    conda-forge

lmcinnes · 2018-11-07T22:17:12Z

My quick workaround is that you need to set ``min_samples`` to something -- with it set to ``None`` it will default to using ``min_samples=min_cluster_size`` which means it will be hunting for the 5000 nearest neighbors of every point internally within the algorithm, and that may be a little expensive, and is almost undoubtedly associated with this error. That being said it should still not be erroring like this. I don't know quite what has gone wrong, but it seems to be in the distribution of the nearest neighbor search stage. It is possible that setting ``core_dist_n_jobs=1`` may resolve the issue, but I honestly can't say. I will try to look into this when I get some time, but I can't promise a swift resolution beyond the workarounds offered here.

…

On Wed, Nov 7, 2018 at 4:20 PM Sarah Bird ***@***.***> wrote: conda env: - scikit-learn 0.20.0 py36h4989274_1 - hdbscan 0.8.18 py36h7eb728f_0 conda-forge - python 3.6.6 h5001a0f_3 conda-forge — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#250 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBYGejPGbxILntiYshqLThn304Eg8ks5us06FgaJpZM4YTUqP> .

birdsarah · 2018-11-08T00:51:17Z

No stress. I definitely am not going to use the values in this range, I was just curious and the error felt like maybe it could be reported more clearly so I figured I'd post it.

Thanks so much for the swift reply.

Keep up the great work @lmcinnes. Am using hdbscan and umap extensively.

gymbeijing mentioned this issue Nov 2, 2023

struct.error: 'i' format requires -2147483648 <= number <= 2147483647 MaartenGr/BERTopic#1612

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with high min_cluster_size - struct.error 'i' format requires.... #250

Error with high min_cluster_size - struct.error 'i' format requires.... #250

birdsarah commented Nov 7, 2018

birdsarah commented Nov 7, 2018 •

edited

Loading

lmcinnes commented Nov 7, 2018 via email

birdsarah commented Nov 8, 2018

Error with high min_cluster_size - struct.error 'i' format requires.... #250

Error with high min_cluster_size - struct.error 'i' format requires.... #250

Comments

birdsarah commented Nov 7, 2018

birdsarah commented Nov 7, 2018 • edited Loading

lmcinnes commented Nov 7, 2018 via email

birdsarah commented Nov 8, 2018

birdsarah commented Nov 7, 2018 •

edited

Loading