Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer? #1109

Closed
henryxiao1997 opened this issue Nov 17, 2022 · 2 comments
Labels

Comments

@henryxiao1997
Copy link

henryxiao1997 commented Nov 17, 2022

There are several steps.

Firstly, I train a tokenizer from a corpus and save it, like this:
`

def Trainer():

tokenizer = BertWordPieceTokenizer()

files = [f"Corpus/wikitext-2-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]

tokenizer.train(files, vocab_size = 1000, min_frequency = 1, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.save("Models/berttokenizer-wiki.json")

`

Then maybe I built a bert based on that tokenizer.

Secondly, I try to load it and continue to train on another corpus (maybe different language), like this:
`
def ContinueTrainer():

tokenizer = Tokenizer.from_file("Models/berttokenizer-wiki.json")

trainer = WordPieceTrainer(vocab_size = 2000, min_frequency = 1, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

files = [f"Corpus/MSRIMECorpus/news{num}.read.3000.txt" for num in range(1,3)]

tokenizer.train(files, trainer)

tokenizer.save("Models/berttokenizer-wiki-msrimecorpus.json")

`

What I expect is that in the new tokenizer, the vocab learned in the first phrase should be kept, then it adds incrementally the new tokens learned from new corpus.

For example, I want to re-use the bert built on the first tokenizer, and just extend their vocab to a new language.

But the fact is that the new tokenizer will change the vocab learned in the first phrase. My question is: is there any way to maintain the leaned vocab in the first learning phrase?

BTW: I get through the python code and want to modify it to meet my need. But I find the trainer of tokenizer is implemented by rust for performance. I cann't read&modify the rust code.

Thanks in advance!

@Narsil
Copy link
Collaborator

Narsil commented Nov 17, 2022

But the fact is that the new tokenizer will change the vocab learned in the first phrase. My question is: is there any way to maintain the leaned vocab in the first learning phrase?

Not really. There are ideas you could try but all would be implementation specific.

The reason why it's not really possible is that by definition tokenizer training is really about compressing the original dataset. So compressing two datasets is just "weird" (either you do a suboptimal job at it, or you loose some information).

A bit more explanation of potential workarounds. #690

IMO the easiest would be two fuse the two datasets into one in relative abundance to obtain what you want. But there are other options for sure.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jan 23, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants