You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Then maybe I built a bert based on that tokenizer.
Secondly, I try to load it and continue to train on another corpus (maybe different language), like this:
`
def ContinueTrainer():
tokenizer = Tokenizer.from_file("Models/berttokenizer-wiki.json")
trainer = WordPieceTrainer(vocab_size = 2000, min_frequency = 1, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
files = [f"Corpus/MSRIMECorpus/news{num}.read.3000.txt" for num in range(1,3)]
tokenizer.train(files, trainer)
tokenizer.save("Models/berttokenizer-wiki-msrimecorpus.json")
`
What I expect is that in the new tokenizer, the vocab learned in the first phrase should be kept, then it adds incrementally the new tokens learned from new corpus.
For example, I want to re-use the bert built on the first tokenizer, and just extend their vocab to a new language.
But the fact is that the new tokenizer will change the vocab learned in the first phrase. My question is: is there any way to maintain the leaned vocab in the first learning phrase?
BTW: I get through the python code and want to modify it to meet my need. But I find the trainer of tokenizer is implemented by rust for performance. I cann't read&modify the rust code.
Thanks in advance!
The text was updated successfully, but these errors were encountered:
But the fact is that the new tokenizer will change the vocab learned in the first phrase. My question is: is there any way to maintain the leaned vocab in the first learning phrase?
Not really. There are ideas you could try but all would be implementation specific.
The reason why it's not really possible is that by definition tokenizer training is really about compressing the original dataset. So compressing two datasets is just "weird" (either you do a suboptimal job at it, or you loose some information).
A bit more explanation of potential workarounds. #690
IMO the easiest would be two fuse the two datasets into one in relative abundance to obtain what you want. But there are other options for sure.
There are several steps.
Firstly, I train a tokenizer from a corpus and save it, like this:
`
def Trainer():
`
Then maybe I built a bert based on that tokenizer.
Secondly, I try to load it and continue to train on another corpus (maybe different language), like this:
`
def ContinueTrainer():
`
What I expect is that in the new tokenizer, the vocab learned in the first phrase should be kept, then it adds incrementally the new tokens learned from new corpus.
For example, I want to re-use the bert built on the first tokenizer, and just extend their vocab to a new language.
But the fact is that the new tokenizer will change the vocab learned in the first phrase. My question is: is there any way to maintain the leaned vocab in the first learning phrase?
BTW: I get through the python code and want to modify it to meet my need. But I find the trainer of tokenizer is implemented by rust for performance. I cann't read&modify the rust code.
Thanks in advance!
The text was updated successfully, but these errors were encountered: