Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding words to vocabulary and training for a new domain #1160

Closed
PeterM18 opened this issue Feb 9, 2023 · 6 comments
Closed

Adding words to vocabulary and training for a new domain #1160

PeterM18 opened this issue Feb 9, 2023 · 6 comments
Labels

Comments

@PeterM18
Copy link

PeterM18 commented Feb 9, 2023

Hi, I'm a noob and am wondering about a few things having to do with adding words to a tokenizer (Roberta's) when I try to apply a pretrained model to a new domain with more specialized vocabulary. I have about one GB of text in the new domain.

Is there any easy / quick way for me to identify what words in my new domain text are missing from Roberta's vocabulary? I guess I could get the Roberta vocabulary, find ways to get rid of special characters (e.g. space character), run through the domain text and get its vocabulary, then compare the two. Just wondering if there's a simple solution I don't know about.

When I use add_tokens to add the words I select, does this make a permanent change to the vocabulary file or create a second 'new words' file? Do I have to explicitly save the new words or vocab file? Likewise, when I extend the embeddings in the model for the new words, is that saved to disk? If I want to revert back to the original Roberta, is the best option to download a fresh copy of Roberta?

Finally, about training the embedding vectors for the new words: I'll want to train Roberta as a language model on the new domain text anyway, so that should help train the embedding vectors. However, this is much like fine-tuning a pre-trained model. The advice on that, as I understand it, is to freeze or give low learning rates the deeper into the model a layer is (relative to the model output). That would mean freezing or giving low learning rates especially to the embedding layer. Is there any good way to let the model train with a high learning rate (or unfrozen) just the embeddings for the new words?

@Narsil
Copy link
Collaborator

Narsil commented Feb 9, 2023

Just wondering if there's a simple solution I don't know about.

No there is not. Anything involving modifying vocabulary is very hands-on, and error-prone.
Usually it's not necessary though because BPE will cover every possible text, and fine-tuning can reuse the same vocab.

If you're going to create a new vocabulary, training from scratch should always be at least considered IMO because fine-tuning with partly new tokens is causing class imbalance in the tokens (old tokens very learned and omni presnet, while new tokens need a lot of gradients and rarely present usually).

When I use add_tokens to add the words I select, does this make a permanent change to the vocabulary file or create a second 'new words' file?

Everything from this lib is in tokenizer.json file. You can see AddedTokens are the one which are added. They are handled very differently from the rest of the tokenization (they are extracted before the regular BPE has a chance to see the text).

Likewise, when I extend the embeddings in the model for the new words, is that saved to disk?

Check for yourself (it should).

If I want to revert back to the original Roberta, is the best option to download a fresh copy of Roberta?

Everything you're intending to do is very much hands-on as I said, so pick whatever works best for you. Using a new fresh version will work.

Is there any good way to let the model train with a high learning rate (or unfrozen) just the embeddings for the new words?

As explained above, this will have some downsides. This is not something which is widely practiced (to the extent of my knowledge) so you're basically on your own figuring out how to do this correctly. I shared my understanding of the problem above, but I don't have a "all-included" solution for you. Please do consider training from scratch as it won't have the issues aforementioned. But if you really want to fine-tune, then yes, you most likely need some kind of specific behavior for the gradients to let them flow to your new vocab, while freezing the old one. It cannot be done (afaik) by default on Pytorch, so you must probably manually split both parts to get this working.
Since the embedding is also shared at the end of the model (in the CausalLM head) there's other tricks to be done there most likely.

Now with all the above being said, I find that approach really nice. If you figure out something that works (or doesn't). I think it would be super nice to share it back here (or somewhere else) !

@PeterM18
Copy link
Author

PeterM18 commented Feb 9, 2023

Narsil: Thank you so much for your thorough answer!

Elsewhere @cronoik suggests a way to perhaps change the learning rate for new tokens in 'pre-fine-tuning' (huggingface/transformers#2691 (comment)):
"You might want to add an additional word_embedding layer for the new tokens and freeze all other layers to save some time." Unfortunately, I'm at a loss for how this would be implemented if this really means adding an embedding layer specific to the new tokens.

There is a paper that shows that domain adapting vocabulary for Roberta can lead to some substantial task improvements: It's described here: https://medium.com/jasonwu0731/pre-finetuning-domain-adaptive-pre-training-of-language-models-db8fa9747668 .

However, that paper is describing an adaptation process involving anywhere from 10-50GB of domain data. I have a bit under 1GB and, to make it worse, the data is multilingual. About 70% is English and Spanish, 95% 5 languages. So even if I had a good way of changing the vocabulary and training the new embeddings, I may well not have enough data to get anywhere. Then, given what you are saying, about pitfalls training the new embeddings, I'm increasingly inclined not to try this, unless I had a good method of pre-finetuning that would target those new embeddings. Will think on it. Again, thanks for your thoughts!

@PeterM18
Copy link
Author

PeterM18 commented Feb 9, 2023

Complements of ChatGPT, below is code that allows the weights of new tokens to take different learning rate (or be unfrozen). Still inclined to think that 1GB of multilingual data will not be enough to try this.

import torch
import transformers

model = transformers.RobertaModel.from_pretrained('roberta-base')
embedding_layer = model.embeddings

weight = embedding_layer.weight

vocab_size = weight.size(0)

new_token_indices = [vocab_size - num_added_toks + i for i in range(num_added_toks)]

# Set a high learning rate for the new tokens' embeddings
high_lr = 1e-3
for index in new_token_indices:
    weight[index].requires_grad = True
    weight[index].lr = high_lr

# Set a low learning rate for the existing tokens' embeddings
low_lr = 1e-5
for index in range(vocab_size - num_added_toks):
    weight[index].requires_grad = True
    weight[index].lr = low_lr

#ALTERNATIVELY, you can freeze the existing token embeddings with:
# Freeze the embeddings of the existing tokens
for index in range(vocab_size - num_added_toks):
    weight[index].requires_grad = False

@Narsil
Copy link
Collaborator

Narsil commented Feb 9, 2023

There is a paper that shows that domain adapting vocabulary for Roberta can lead to some substantial task improvements: It's described here: https://medium.com/jasonwu0731/pre-finetuning-domain-adaptive-pre-training-of-language-models-db8fa9747668 .

Nice, thanks for sharing.

Complements of ChatGPT,

Have you tried it ? Classic ChatGPT imo, it looks sane, but it simply doesn't work. Not even close.
Learning rate cannot be applied weight per weight, it's a global optimizer state.
Don't mind my old dev grumpiness, ChatGPT is impressive, but omg it cannot code nor think.

@PeterM18
Copy link
Author

Thanks Narsil! Good point about taking ChatGPT code at face value--I've seen it make mistakes. I haven't tried the code it suggested. On the other hand, learning rate can't be a fully global optimizer state. I've seen professional code on Kaggle that shows how to apply varying learning rates by layer. But I don't see a viable way of adjusting learning rates by parameter mentioned anywhere except as a comment here: https://stackoverflow.com/questions/59013887/parameter-specific-learning-rate-in-pytorch. Apparently, however, that requires programming your own optimizer. Good to know how difficult this will be, if not impossible.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jan 13, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants