Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128) #2107

Open
1 task done
rudysterner opened this issue Aug 2, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@rudysterner
Copy link

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

Hello author, thank you for sharing this. I am having some problems with the code you provided and would like to ask you about it. There are two tasks, task A succeeds but task B fails, the error message is like this: UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128) Looking forward to your reply! (Note: Task A has 9995 texts, Task B has more than 36000 texts)

  1. My code(task B) is as follows:
    import numpy as np
    from bertopic import BERTopic
    from transformers.pipelines import pipeline
    from sentence_transformers import SentenceTransformer
    from umap import UMAP
    from hdbscan import HDBSCAN
    from sklearn.feature_extraction.text import CountVectorizer

with open(r'C:\Users\李书智\Desktop\切词后.txt', 'r', encoding='utf-8') as file:
docs = file.readlines()
print('条数: ', len(docs))
print('预览第一条: ', docs[0])

vectorizer_model = None

1. 词向量模型,同时加载本地训练好的词向量

#embedding_model = pipeline("feature-extraction", model="bert-base-chinese") # 使用bert-base-chinese
#embeddings = np.load(R'C:\Users\李书智\Downloads\BBC.npy') # 使用bert-base-chinese向量
#print('向量shape:', embeddings.shape)

替换: 使用hfl模型

embedding_model = pipeline("feature-extraction", model="hfl/chinese-bert-wwm")

embeddings = np.load('C:\Users\李书智\Downloads\BBC\emb.npy')

print('向量shape:', embeddings.shape)

替换: 使用Sentencetransformers模型

embedding_model = embedding_model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2',)
embeddings = np.load(r'C:\Users\李书智\Desktop\STweiboNEIRONG.npy')
print(embeddings.shape)

2. 创建UMAP降维模型

umap_model = UMAP(
n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine',
random_state=30 # ⚠️ 防止随机 https://maartengr.github.io/BERTopic/faq.html
)

3. 创建HDBSCAN聚类模型

如果要建设离群值,可以减小下面两个参数min_cluster_size min_samples

https://hdbscan.readthedocs.io/en/latest/faq.html

hdbscan_model = HDBSCAN(
min_cluster_size=50,
min_samples=50,
metric='euclidean'
)

5. 创建CountVectorizer模型

vectorizer_model = CountVectorizer(stop_words=['洛阳', '旅游', '文化'])

6. 正式创建BERTopic模型

topic_model = BERTopic(
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
)

查看主题

topics, probs = topic_model.fit_transform(docs, embeddings=embeddings) #传入训练好的词向量
topic_model.get_topic_info()

  1. The error is reported as follows:

{
"name": "UnicodeEncodeError",
"message": "'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)",
"stack": "---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
Cell In[22], line 2
1 # 查看主题
----> 2 topics, probs = topic_model.fit_transform(docs, embeddings=embeddings) #传入训练好的词向量
3 topic_model.get_topic_info()

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\bertopic\_bertopic.py:389, in BERTopic.fit_transform(self, documents, embeddings, images, y)
386 umap_embeddings = self._reduce_dimensionality(embeddings, y)
388 # Cluster reduced embeddings
--> 389 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
391 # Sort and Map Topic IDs by their frequency
392 if not self.nr_topics:

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\bertopic\_bertopic.py:3218, in BERTopic._cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
3216 else:
3217 try:
-> 3218 self.hdbscan_model.fit(umap_embeddings, y=y)
3219 except TypeError:
3220 self.hdbscan_model.fit(umap_embeddings)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan_.py:1205, in HDBSCAN.fit(self, X, y)
1195 kwargs.pop("prediction_data", None)
1196 kwargs.update(self.metric_kwargs)
1198 (
1199 self.labels
,
1200 self.probabilities_,
1201 self.cluster_persistence_,
1202 self._condensed_tree,
1203 self._single_linkage_tree,
1204 self._min_spanning_tree,
-> 1205 ) = hdbscan(clean_data, **kwargs)
1207 if self.metric != "precomputed" and not self._all_finite:
1208 # remap indices to align with original data in the case of non-finite entries.
1209 self._condensed_tree = remap_condensed_tree(
1210 self._condensed_tree, internal_to_raw, outliers
1211 )

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan_.py:837, in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
824 (single_linkage_tree, result_min_span_tree) = memory.cache(
825 _hdbscan_prims_kdtree
826 )(
(...)
834 **kwargs
835 )
836 else:
--> 837 (single_linkage_tree, result_min_span_tree) = memory.cache(
838 _hdbscan_boruvka_kdtree
839 )(
840 X,
841 min_samples,
842 alpha,
843 metric,
844 p,
845 leaf_size,
846 approx_min_span_tree,
847 gen_min_span_tree,
848 core_dist_n_jobs,
849 **kwargs
850 )
851 else: # Metric is a valid BallTree metric
852 # TO DO: Need heuristic to decide when to go to boruvka;
853 # still debugging for now
854 if X.shape[1] > 60:

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\memory.py:312, in NotMemorizedFunc.call(self, *args, **kwargs)
311 def call(self, *args, **kwargs):
--> 312 return self.func(*args, **kwargs)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan_.py:340, in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
337 X = X.astype(np.float64)
339 tree = KDTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
--> 340 alg = KDTreeBoruvkaAlgorithm(
341 tree,
342 min_samples,
343 metric=metric,
344 leaf_size=leaf_size // 3,
345 approx_min_span_tree=approx_min_span_tree,
346 n_jobs=core_dist_n_jobs,
347 **kwargs
348 )
349 min_spanning_tree = alg.spanning_tree()
350 # Sort edges of the min_spanning_tree by weight

File hdbscan\_hdbscan_boruvka.pyx:392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init()

File hdbscan\_hdbscan_boruvka.pyx:426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\parallel.py:1909, in Parallel.call(self, iterable)
1906 self._start_time = time.time()
1908 if not self._managed_backend:
-> 1909 n_jobs = self._initialize_backend()
1910 else:
1911 n_jobs = self._effective_n_jobs()

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\parallel.py:1359, in Parallel._initialize_backend(self)
1357 """Build a process or thread pool and return the number of workers"""
1358 try:
-> 1359 n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
1360 **self._backend_args)
1361 if self.timeout is not None and not self._backend.supports_timeout:
1362 warnings.warn(
1363 'The backend class {!r} does not support timeout. '
1364 "You have set 'timeout={}' in Parallel but "
1365 "the 'timeout' parameter will not be used.".format(
1366 self._backend.class.name,
1367 self.timeout))

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_parallel_backends.py:538, in LokyBackend.configure(self, n_jobs, parallel, prefer, require, idle_worker_timeout, **memmappingexecutor_args)
534 if n_jobs == 1:
535 raise FallbackToBackend(
536 SequentialBackend(nesting_level=self.nesting_level))
--> 538 self._workers = get_memmapping_executor(
539 n_jobs, timeout=idle_worker_timeout,
540 env=self._prepare_worker_env(n_jobs=n_jobs),
541 context_id=parallel._id, **memmappingexecutor_args)
542 self.parallel = parallel
543 return n_jobs

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\executor.py:20, in get_memmapping_executor(n_jobs, **kwargs)
19 def get_memmapping_executor(n_jobs, **kwargs):
---> 20 return MemmappingExecutor.get_memmapping_executor(n_jobs, **kwargs)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\executor.py:42, in MemmappingExecutor.get_memmapping_executor(cls, n_jobs, timeout, initializer, initargs, env, temp_folder, context_id, **backend_args)
39 reuse = _executor_args is None or _executor_args == executor_args
40 _executor_args = executor_args
---> 42 manager = TemporaryResourcesManager(temp_folder)
44 # reducers access the temporary folder in which to store temporary
45 # pickles through a call to manager.resolve_temp_folder_name. resolving
46 # the folder name dynamically is useful to use different folders across
47 # calls of a same reusable executor
48 job_reducers, result_reducers = get_memmapping_reducers(
49 unlink_on_gc_collect=True,
50 temp_folder_resolver=manager.resolve_temp_folder_name,
51 **backend_args)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_memmapping_reducer.py:540, in TemporaryResourcesManager.init(self, temp_folder_root, context_id)
534 if context_id is None:
535 # It would be safer to not assign a default context id (less silent
536 # bugs), but doing this while maintaining backward compatibility
537 # with the previous, context-unaware version get_memmaping_executor
538 # exposes too many low-level details.
539 context_id = uuid4().hex
--> 540 self.set_current_context(context_id)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_memmapping_reducer.py:544, in TemporaryResourcesManager.set_current_context(self, context_id)
542 def set_current_context(self, context_id):
543 self._current_context_id = context_id
--> 544 self.register_new_context(context_id)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\memmapping_reducer.py:569, in TemporaryResourcesManager.register_new_context(self, context_id)
562 new_folder_name = (
563 "joblib_memmapping_folder
{}{}{}".format(
564 os.getpid(), self._id, context_id)
565 )
566 new_folder_path, _ = _get_temp_dir(
567 new_folder_name, self._temp_folder_root
568 )
--> 569 self.register_folder_finalizer(new_folder_path, context_id)
570 self._cached_temp_folders[context_id] = new_folder_path

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_memmapping_reducer.py:585, in TemporaryResourcesManager.register_folder_finalizer(self, pool_subfolder, context_id)
578 def register_folder_finalizer(self, pool_subfolder, context_id):
579 # Register the garbage collector at program exit in case caller forgets
580 # to call terminate explicitly: note we do not pass any reference to
581 # ensure that this callback won't prevent garbage collection of
582 # parallel instance and related file handler resources such as POSIX
583 # semaphores and pipes
584 pool_module_name = whichmodule(delete_folder, 'delete_folder')
--> 585 resource_tracker.register(pool_subfolder, "folder")
587 def _cleanup():
588 # In some cases the Python runtime seems to set delete_folder to
589 # None just before exiting when accessing the delete_folder
(...)
594 # because joblib should only use relative imports to allow
595 # easy vendoring.
596 delete_folder = import(
597 pool_module_name, fromlist=['delete_folder']
598 ).delete_folder

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\externals\loky\backend\resource_tracker.py:179, in ResourceTracker.register(self, name, rtype)
177 """Register a named resource, and increment its refcount."""
178 self.ensure_running()
--> 179 self._send("REGISTER", name, rtype)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\externals\loky\backend\resource_tracker.py:196, in ResourceTracker._send(self, cmd, name, rtype)
192 if len(name) > 512:
193 # posix guarantees that writes to a pipe of less than PIPE_BUF
194 # bytes are atomic, and that PIPE_BUF >= 512
195 raise ValueError("name too long")
--> 196 msg = f"{cmd}:{name}:{rtype}
".encode("ascii")
197 nbytes = os.write(self._fd, msg)
198 assert nbytes == len(msg)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)"
}

Reproduction

1.task Aimport numpy as np
from bertopic import BERTopic
from transformers.pipelines import pipeline
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer


with open(r'D:\zhuhongchang\python_study\萝卜快跑\2.数据预处理\4.分词\微博内容_切词.txt', 'r', encoding='utf-8') as file:
  docs = file.readlines()
print('条数: ', len(docs))
print('预览第一条: ', docs[0])

vectorizer_model = None

# 1. 词向量模型,同时加载本地训练好的词向量
#embedding_model = pipeline("feature-extraction", model="bert-base-chinese") # 使用bert-base-chinese
#embeddings = np.load(R'C:\Users\李书智\Downloads\BBC.npy') # 使用bert-base-chinese向量
#print('向量shape:', embeddings.shape)

# 替换: 使用hfl模型
# embedding_model = pipeline("feature-extraction", model="hfl/chinese-bert-wwm")
# embeddings = np.load('C:\Users\李书智\Downloads\BBC\emb.npy') 
# print('向量shape:', embeddings.shape)

# 替换: 使用Sentencetransformers模型
embedding_model = embedding_model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2',)
embeddings = np.load(r'D:\zhuhongchang\python_study\萝卜快跑\3.数据处理\1.文本转向量\词向量setence-transformer.npy')
print(embeddings.shape)

# 2. 创建UMAP降维模型
umap_model = UMAP(
  n_neighbors=15,
  n_components=5,
  min_dist=0.0,
  metric='cosine',
  random_state=30  # ⚠️ 防止随机 https://maartengr.github.io/BERTopic/faq.html
)

# 3. 创建HDBSCAN聚类模型
# 如果要建设离群值,可以减小下面两个参数min_cluster_size min_samples
# https://hdbscan.readthedocs.io/en/latest/faq.html
hdbscan_model = HDBSCAN(
  min_cluster_size=50,
  min_samples=50,
  metric='euclidean'
)

# 5. 创建CountVectorizer模型
vectorizer_model = CountVectorizer(stop_words=['洛阳', '旅游', '文化'])

# 6. 正式创建BERTopic模型
topic_model = BERTopic(
  embedding_model=embedding_model,
  vectorizer_model=vectorizer_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
)

# 查看主题
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings) #传入训练好的词向量
topic_model.get_topic_info()

2.task Bimport numpy as np
from bertopic import BERTopic
from transformers.pipelines import pipeline
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

with open(r'C:\Users\李书智\Desktop\切词后.txt', 'r', encoding='utf-8') as file:
  docs = file.readlines()
print('条数: ', len(docs))
print('预览第一条: ', docs[0])

vectorizer_model = None

# 1. 词向量模型,同时加载本地训练好的词向量
#embedding_model = pipeline("feature-extraction", model="bert-base-chinese") # 使用bert-base-chinese
#embeddings = np.load(R'C:\Users\李书智\Downloads\BBC.npy') # 使用bert-base-chinese向量
#print('向量shape:', embeddings.shape)

# 替换: 使用hfl模型
# embedding_model = pipeline("feature-extraction", model="hfl/chinese-bert-wwm")
# embeddings = np.load('C:\Users\李书智\Downloads\BBC\emb.npy') 
# print('向量shape:', embeddings.shape)

# 替换: 使用Sentencetransformers模型
embedding_model = embedding_model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2',)
embeddings = np.load(r'C:\Users\李书智\Desktop\STweiboNEIRONG.npy')
print(embeddings.shape)

# 2. 创建UMAP降维模型
umap_model = UMAP(
  n_neighbors=15,
  n_components=5,
  min_dist=0.0,
  metric='cosine',
  random_state=30  # ⚠️ 防止随机 https://maartengr.github.io/BERTopic/faq.html
)

# 3. 创建HDBSCAN聚类模型
# 如果要建设离群值,可以减小下面两个参数min_cluster_size min_samples
# https://hdbscan.readthedocs.io/en/latest/faq.html
hdbscan_model = HDBSCAN(
  min_cluster_size=50,
  min_samples=50,
  metric='euclidean'
)

# 5. 创建CountVectorizer模型
vectorizer_model = CountVectorizer(stop_words=['洛阳', '旅游', '文化'])

# 6. 正式创建BERTopic模型
topic_model = BERTopic(
  embedding_model=embedding_model,
  vectorizer_model=vectorizer_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
)

# 查看主题
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings) #传入训练好的词向量
topic_model.get_topic_info()


### BERTopic Version

unknow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant