How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer? #1109

henryxiao1997 · 2022-11-17T13:45:48Z

There are several steps.

Firstly, I train a tokenizer from a corpus and save it, like this:
`

def Trainer():

tokenizer = BertWordPieceTokenizer()

files = [f"Corpus/wikitext-2-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]

tokenizer.train(files, vocab_size = 1000, min_frequency = 1, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.save("Models/berttokenizer-wiki.json")

`

Then maybe I built a bert based on that tokenizer.

Secondly, I try to load it and continue to train on another corpus (maybe different language), like this:
`
def ContinueTrainer():

tokenizer = Tokenizer.from_file("Models/berttokenizer-wiki.json")

trainer = WordPieceTrainer(vocab_size = 2000, min_frequency = 1, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

files = [f"Corpus/MSRIMECorpus/news{num}.read.3000.txt" for num in range(1,3)]

tokenizer.train(files, trainer)

tokenizer.save("Models/berttokenizer-wiki-msrimecorpus.json")

`

What I expect is that in the new tokenizer, the vocab learned in the first phrase should be kept, then it adds incrementally the new tokens learned from new corpus.

For example, I want to re-use the bert built on the first tokenizer, and just extend their vocab to a new language.

But the fact is that the new tokenizer will change the vocab learned in the first phrase. My question is: is there any way to maintain the leaned vocab in the first learning phrase?

BTW: I get through the python code and want to modify it to meet my need. But I find the trainer of tokenizer is implemented by rust for performance. I cann't read&modify the rust code.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

Narsil · 2022-11-17T14:33:27Z

But the fact is that the new tokenizer will change the vocab learned in the first phrase. My question is: is there any way to maintain the leaned vocab in the first learning phrase?

Not really. There are ideas you could try but all would be implementation specific.

The reason why it's not really possible is that by definition tokenizer training is really about compressing the original dataset. So compressing two datasets is just "weird" (either you do a suboptimal job at it, or you loose some information).

A bit more explanation of potential workarounds. #690

IMO the easiest would be two fuse the two datasets into one in relative abundance to obtain what you want. But there are other options for sure.

github-actions · 2024-01-23T01:51:12Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

liujuncn mentioned this issue Mar 29, 2023

tokenizer.add_tokens() will use duplicate id if called before tokenizer.train() #1200

Closed

ArthurZucker mentioned this issue Jan 8, 2024

Adding New Vocabulary Tokens to the Models huggingface/transformers#1413

Closed

github-actions bot added the Stale label Jan 23, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer? #1109

How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer? #1109

henryxiao1997 commented Nov 17, 2022 •

edited

Loading

Narsil commented Nov 17, 2022

github-actions bot commented Jan 23, 2024

How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer? #1109

How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer? #1109

Comments

henryxiao1997 commented Nov 17, 2022 • edited Loading

Narsil commented Nov 17, 2022

github-actions bot commented Jan 23, 2024

henryxiao1997 commented Nov 17, 2022 •

edited

Loading