-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding words to vocabulary and training for a new domain #1160
Comments
No there is not. Anything involving modifying vocabulary is very hands-on, and error-prone. If you're going to create a new vocabulary, training from scratch should always be at least considered IMO because fine-tuning with partly new tokens is causing class imbalance in the tokens (old tokens very learned and omni presnet, while new tokens need a lot of gradients and rarely present usually).
Everything from this lib is in
Check for yourself (it should).
Everything you're intending to do is very much hands-on as I said, so pick whatever works best for you. Using a new fresh version will work.
As explained above, this will have some downsides. This is not something which is widely practiced (to the extent of my knowledge) so you're basically on your own figuring out how to do this correctly. I shared my understanding of the problem above, but I don't have a "all-included" solution for you. Please do consider training from scratch as it won't have the issues aforementioned. But if you really want to fine-tune, then yes, you most likely need some kind of specific behavior for the gradients to let them flow to your new vocab, while freezing the old one. It cannot be done (afaik) by default on Pytorch, so you must probably manually split both parts to get this working. Now with all the above being said, I find that approach really nice. If you figure out something that works (or doesn't). I think it would be super nice to share it back here (or somewhere else) ! |
Narsil: Thank you so much for your thorough answer! Elsewhere @cronoik suggests a way to perhaps change the learning rate for new tokens in 'pre-fine-tuning' (huggingface/transformers#2691 (comment)): There is a paper that shows that domain adapting vocabulary for Roberta can lead to some substantial task improvements: It's described here: https://medium.com/jasonwu0731/pre-finetuning-domain-adaptive-pre-training-of-language-models-db8fa9747668 . However, that paper is describing an adaptation process involving anywhere from 10-50GB of domain data. I have a bit under 1GB and, to make it worse, the data is multilingual. About 70% is English and Spanish, 95% 5 languages. So even if I had a good way of changing the vocabulary and training the new embeddings, I may well not have enough data to get anywhere. Then, given what you are saying, about pitfalls training the new embeddings, I'm increasingly inclined not to try this, unless I had a good method of pre-finetuning that would target those new embeddings. Will think on it. Again, thanks for your thoughts! |
Complements of ChatGPT, below is code that allows the weights of new tokens to take different learning rate (or be unfrozen). Still inclined to think that 1GB of multilingual data will not be enough to try this.
|
Nice, thanks for sharing.
Have you tried it ? Classic ChatGPT imo, it looks sane, but it simply doesn't work. Not even close. |
Thanks Narsil! Good point about taking ChatGPT code at face value--I've seen it make mistakes. I haven't tried the code it suggested. On the other hand, learning rate can't be a fully global optimizer state. I've seen professional code on Kaggle that shows how to apply varying learning rates by layer. But I don't see a viable way of adjusting learning rates by parameter mentioned anywhere except as a comment here: https://stackoverflow.com/questions/59013887/parameter-specific-learning-rate-in-pytorch. Apparently, however, that requires programming your own optimizer. Good to know how difficult this will be, if not impossible. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Hi, I'm a noob and am wondering about a few things having to do with adding words to a tokenizer (Roberta's) when I try to apply a pretrained model to a new domain with more specialized vocabulary. I have about one GB of text in the new domain.
Is there any easy / quick way for me to identify what words in my new domain text are missing from Roberta's vocabulary? I guess I could get the Roberta vocabulary, find ways to get rid of special characters (e.g. space character), run through the domain text and get its vocabulary, then compare the two. Just wondering if there's a simple solution I don't know about.
When I use add_tokens to add the words I select, does this make a permanent change to the vocabulary file or create a second 'new words' file? Do I have to explicitly save the new words or vocab file? Likewise, when I extend the embeddings in the model for the new words, is that saved to disk? If I want to revert back to the original Roberta, is the best option to download a fresh copy of Roberta?
Finally, about training the embedding vectors for the new words: I'll want to train Roberta as a language model on the new domain text anyway, so that should help train the embedding vectors. However, this is much like fine-tuning a pre-trained model. The advice on that, as I understand it, is to freeze or give low learning rates the deeper into the model a layer is (relative to the model output). That would mean freezing or giving low learning rates especially to the embedding layer. Is there any good way to let the model train with a high learning rate (or unfrozen) just the embeddings for the new words?
The text was updated successfully, but these errors were encountered: