You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello~ I'm trying to train a BPE tokenizer with a customized pre_tokenizer.
The customized pre_tokenizer used a 3rd party package likes what has been shown in
Exception: Custom PreTokenizer cannot be serialized
I can see that a customized pre_tokenizer cannot be saved with the main tokenizer model, so I should save the main model individually. When loading the tokenizer, I should manually add the pre_tokenizer. Am I right?
The text was updated successfully, but these errors were encountered:
Yes you are right, this is something that we'd like to support in the future though.
In the meantime you can either:
save the model and then load everything back manually. If you don't have a complicated tokenizer with many special tokens and components it might be well suited
use a "placeholder" PreTokenizer before saving your tokenizer, that you replace by your custom one after loading back.
Hello~ I'm trying to train a BPE tokenizer with a customized pre_tokenizer.
The customized pre_tokenizer used a 3rd party package likes what has been shown in
tokenizers/bindings/python/examples/custom_components.py
Line 12 in dcb3bba
after training the tokenizer, I tried to use
to save the tokenizer, but an Exception appeared:
I can see that a customized pre_tokenizer cannot be saved with the main tokenizer model, so I should save the main model individually. When loading the tokenizer, I should manually add the pre_tokenizer. Am I right?
The text was updated successfully, but these errors were encountered: