Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyo3_runtime.PanicException: AddedVocabulary bad split #1

Open
kalvinchang opened this issue Jun 9, 2023 · 4 comments
Open

pyo3_runtime.PanicException: AddedVocabulary bad split #1

kalvinchang opened this issue Jun 9, 2023 · 4 comments

Comments

@kalvinchang
Copy link

kalvinchang commented Jun 9, 2023

The following code triggered pyo3_runtime.PanicException: AddedVocabulary bad split

from transformers import pipeline
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws-xiandai")

def word_segment(sentence):
    segmented = classifier(sentence)
    sentence = []
    for word in segmented:
        sentence.append(word['word'])
    return sentence

print(word_segment("我想去吃飯"))

(both transformers 4.22.1 and 4.30.0)

thread '' panicked at 'AddedVocabulary bad split', tokenizers-lib/src/tokenizer/added_vocabulary.rs:360:22
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Traceback (most recent call last):
File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 104, in
File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 98, in word_segment
for word in segmented:
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 192, in call
return super().call(inputs, **kwargs)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1074, in call
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1080, in run_single
model_inputs = self.preprocess(inputs, **preprocess_params)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 196, in preprocess
model_inputs = self.tokenizer(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2484, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2590, in _call_one
return self.encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2663, in encode_plus
return self._encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 500, in _encode_plus
batched_output = self._batch_encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 427, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: AddedVocabulary bad split

@weihanchen
Copy link

is there any solution?

@kalvinchang
Copy link
Author

Not that I know of :/

@Jiahao004
Copy link

Hi, I came across the same issue, after I added new vocabulary to the "bert-base-multilingual-cased" tokenizer. May I know your solution?

@shivanraptor
Copy link

Hi, I came across the same issue, after I added new vocabulary to the "bert-base-multilingual-cased" tokenizer. May I know your solution?

I had the same issue as well, but my base model is "bart-base-chinese"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants