Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

struct.error: 'i' format requires -2147483648 <= number <= 2147483647 #1612

Closed
gymbeijing opened this issue Nov 2, 2023 · 2 comments
Closed

Comments

@gymbeijing
Copy link

gymbeijing commented Nov 2, 2023

Hi, thank you for releasing this helpful tool! I had this error when I tried to clustering on my 555k documents (tweet texts actually).

[2023-11-02 03:16:17,821]:[MainProcess][INFO]:[root] Model fit and transform documents(len=555297)
[2023-11-02 03:16:17,985]:[MainProcess][INFO]:[sentence_transformers.SentenceTransformer] Load pretrained SentenceTransformer: sentence-transformers/all-Mini
LM-L6-v2
[2023-11-02 03:16:18,477]:[MainProcess][INFO]:[sentence_transformers.SentenceTransformer] Use pytorch device: cuda
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 344, in _sendback_result
    exception=exception))
  File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/externals/loky/backend/queues.py", line 192, in put
    self._writer.send_bytes(obj)
  File "/import/linux/python/3.7.7/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/import/linux/python/3.7.7/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "cluster_text.py", line 88, in <module>
    topics, probs = topic_model.fit_transform(tweet_texts)
  File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/bertopic/_bertopic.py", line 389, in fit_transform
    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
  File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/bertopic/_bertopic.py", line 3218, in _cluster_embeddings
    self.hdbscan_model.fit(umap_embeddings, y=y)
  File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 1205, in fit
    ) = hdbscan(clean_data, **kwargs)
  File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 849, in hdbscan
    **kwargs
  File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 347, in _hdbscan_boruvka_kdtree
    **kwargs
  File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__
  File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
  File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/parallel.py", line 1098, in __call__
    self.retrieve()
  File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/parallel.py", line 975, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/homes/yg007/nytimes_project/venv_nyc/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
    return future.result(timeout=timeout)
  File "/import/linux/python/3.7.7/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/import/linux/python/3.7.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Could you help me solve this problem? Thanks!

Attached my code for reference:

#!/usr/bin/env python
# coding: utf-8

import argparse
import json
import logging
import os

import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP
import re
from nltk.tokenize import TweetTokenizer


# Logger
logger = logging.getLogger()
logging.basicConfig(
    level=os.environ.get("LOGLEVEL", "INFO"),
    format="[%(asctime)s]:[%(processName)-11s]" + "[%(levelname)-s]:[%(name)s] %(message)s",
)

# Tokenizer
tt = TweetTokenizer()

# # Environment variable - doesn't affect the error
# os.environ["TOKENIZERS_PARALLELISM"] = "false"


def remove_url(text):
    """Remove URLs from a sample string"""
    return re.sub(r"http\S+", '', text)


def remove_punc(text):
    """Remove punctuation from a sample string"""
    return re.sub(r'[^\w\s]', '', text)


def preprocess(text):
    preprocessed_text = ' '.join(tt.tokenize(text))
    preprocessed_text = remove_punc(remove_url(preprocessed_text))
    return preprocessed_text


def parse_args():
    p = argparse.ArgumentParser()
    p.add_argument("--topic", type=str, required=True, help="{climate, covid, military}")

    args = p.parse_args()
    return args


if __name__ == "__main__":
    param_dict = {"climate": {"min_cluster_size": 400, "cluster_selection_epsilon": 0.56},
                  "covid": {"min_cluster_size": 1200, "cluster_selection_epsilon": 0.65},
                  "military": {"min_cluster_size": 100, "cluster_selection_epsilon": 0.6}}
    # Parse arguments
    args = parse_args()
    topic = args.topic

    logging.info("Prepare topic model")
    param = param_dict[topic]
    hdbscan_model = HDBSCAN(min_cluster_size=param["min_cluster_size"], metric='euclidean',
                            cluster_selection_method='eom', cluster_selection_epsilon=param["cluster_selection_epsilon"],
                            prediction_data=True)
    umap_model = UMAP(n_neighbors=10, n_components=20, min_dist=0.0, metric='cosine')
    topic_model = BERTopic(hdbscan_model=hdbscan_model, umap_model=umap_model)

    logging.info("Read .feather file")
    train_feather_path = '../raw_data/train_completed_exist.feather'
    train_df = pd.read_feather(train_feather_path)  # already drop the non-exists

    val_feather_path = '../raw_data/val_completed_exist.feather'
    val_df = pd.read_feather(val_feather_path)  # already drop the non-exists

    df = pd.concat([train_df, val_df])

    logging.info("Prepare documents")
    df['preprocessed_full_text'] = df['full_text'].apply(lambda t: preprocess(t))
    temp_df = df[df['topic'].str.contains(topic)]
    tweet_texts = temp_df['preprocessed_full_text'].unique().tolist()

    logging.info(f"Model fit and transform documents(len={len(tweet_texts)})")
    topics, probs = topic_model.fit_transform(tweet_texts)
    logging.info("Get topic info")
    print(topic_model.get_topic_info()[["Topic", "Count", "Name"]])
@gymbeijing
Copy link
Author

Fixed. The error originated from calling hdbscan. According to hdbscan issue #250 and hdbscan issue #372, I lowered the min_cluster_size and the error resolved for now.

@MaartenGr
Copy link
Owner

Great! Glad to hear that the issue was resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants