You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thank you for releasing this helpful tool! I had this error when I tried to clustering on my 555k documents (tweet texts actually).
[2023-11-02 03:16:17,821]:[MainProcess][INFO]:[root] Model fit and transform documents(len=555297)
[2023-11-02 03:16:17,985]:[MainProcess][INFO]:[sentence_transformers.SentenceTransformer] Load pretrained SentenceTransformer: sentence-transformers/all-Mini
LM-L6-v2
[2023-11-02 03:16:18,477]:[MainProcess][INFO]:[sentence_transformers.SentenceTransformer] Use pytorch device: cuda
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 344, in _sendback_result
exception=exception))
File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/externals/loky/backend/queues.py", line 192, in put
self._writer.send_bytes(obj)
File "/import/linux/python/3.7.7/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/import/linux/python/3.7.7/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "cluster_text.py", line 88, in <module>
topics, probs = topic_model.fit_transform(tweet_texts)
File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/bertopic/_bertopic.py", line 389, in fit_transform
documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/bertopic/_bertopic.py", line 3218, in _cluster_embeddings
self.hdbscan_model.fit(umap_embeddings, y=y)
File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 1205, in fit
) = hdbscan(clean_data, **kwargs)
File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 849, in hdbscan
**kwargs
File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/memory.py", line 349, in __call__
return self.func(*args, **kwargs)
File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 347, in _hdbscan_boruvka_kdtree
**kwargs
File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__
File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/parallel.py", line 1098, in __call__
self.retrieve()
File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/parallel.py", line 975, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
return future.result(timeout=timeout)
File "/import/linux/python/3.7.7/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/import/linux/python/3.7.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
Fixed. The error originated from calling hdbscan. According to hdbscan issue #250 and hdbscan issue #372, I lowered the min_cluster_size and the error resolved for now.
Hi, thank you for releasing this helpful tool! I had this error when I tried to clustering on my 555k documents (tweet texts actually).
Could you help me solve this problem? Thanks!
Attached my code for reference:
The text was updated successfully, but these errors were encountered: