-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability to serialize custom Python components #581
Comments
This is a useful feature. We can probably serialize Python objects using |
The end result has to be saved as JSON, I don't think it's doable. Currently the workaround, is to override the component before save, and override after load tokenizer.pre_tokenizer = Custom()
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.save("tok.json")
## Load later
tokenizer = Tokenizer.from_file("tok.json")
tokenizer.pre_tokenizer = Custom() It is a bit inconvenient but at least it's safe and portable. |
You also can't load it as a from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer) As a workaround I do from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
fast_tokenizer._tokenizer.pre_tokenizer=PreTokenizer.custom(CustomPreTokenizer()) but using overriding using the private |
Totally understandable. What kind of pre-tokenizer are you saving ? |
Is now can saving the custom pretokenizer? |
No. custom is python code, it's not serializable by nature. |
Hi @Narsil , k-mer tokenization is used in many applications in bioinformatics. Right now I am doing the following to define my tokenizer, save and load my model, which I now know is not ideal. I wondered if there is a way to use serializable building blocks to save/load the tokenizer as any other HF tokenizer. Thank you
|
It is currently impossible to serialize custom Python components, so if a
Tokenizer
embeds some of them, the user can't save it.I didn't really dig this so I don't know exactly what would be the constraints/requirements, but this is something we should explore at some point.
The text was updated successfully, but these errors were encountered: