pyo3_runtime.PanicException: AddedVocabulary bad split #1

kalvinchang · 2023-06-09T05:59:38Z

The following code triggered pyo3_runtime.PanicException: AddedVocabulary bad split

from transformers import pipeline
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws-xiandai")

def word_segment(sentence):
    segmented = classifier(sentence)
    sentence = []
    for word in segmented:
        sentence.append(word['word'])
    return sentence

print(word_segment("我想去吃飯"))

(both transformers 4.22.1 and 4.30.0)

thread '' panicked at 'AddedVocabulary bad split', tokenizers-lib/src/tokenizer/added_vocabulary.rs:360:22
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Traceback (most recent call last):
File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 104, in
File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 98, in word_segment
for word in segmented:
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 192, in call
return super().call(inputs, **kwargs)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1074, in call
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1080, in run_single
model_inputs = self.preprocess(inputs, **preprocess_params)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 196, in preprocess
model_inputs = self.tokenizer(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2484, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2590, in _call_one
return self.encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2663, in encode_plus
return self._encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 500, in _encode_plus
batched_output = self._batch_encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 427, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: AddedVocabulary bad split

The text was updated successfully, but these errors were encountered:

weihanchen · 2023-07-03T03:53:10Z

is there any solution?

kalvinchang · 2023-07-05T03:18:30Z

Not that I know of :/

Jiahao004 · 2023-11-08T12:33:26Z

Hi, I came across the same issue, after I added new vocabulary to the "bert-base-multilingual-cased" tokenizer. May I know your solution?

shivanraptor · 2024-07-29T02:55:07Z

Hi, I came across the same issue, after I added new vocabulary to the "bert-base-multilingual-cased" tokenizer. May I know your solution?

I had the same issue as well, but my base model is "bart-base-chinese"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyo3_runtime.PanicException: AddedVocabulary bad split #1

pyo3_runtime.PanicException: AddedVocabulary bad split #1

kalvinchang commented Jun 9, 2023 •

edited

Loading

weihanchen commented Jul 3, 2023

kalvinchang commented Jul 5, 2023

Jiahao004 commented Nov 8, 2023

shivanraptor commented Jul 29, 2024

pyo3_runtime.PanicException: AddedVocabulary bad split #1

pyo3_runtime.PanicException: AddedVocabulary bad split #1

Comments

kalvinchang commented Jun 9, 2023 • edited Loading

weihanchen commented Jul 3, 2023

kalvinchang commented Jul 5, 2023

Jiahao004 commented Nov 8, 2023

shivanraptor commented Jul 29, 2024

kalvinchang commented Jun 9, 2023 •

edited

Loading