-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding New Vocabulary Tokens to the Models #1413
Comments
Hi, I believe this method does exactly what you're looking for: add_tokens. There's an example right below it. |
thanks @LysandreJik ! yes, that's exactly what I was looking for. A follow-up question: How could I initialize the embeddings of these "new tokens" to something I already have pre-computed? I assume currently, embedding for these new tokens will be randomly initialized. |
You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")
print(len(tokenizer)) # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer)) # 28997
model.resize_token_embeddings(len(tokenizer))
# The new vector is added at the end of the embedding matrix
print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix
model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])
print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768] |
thanks @LysandreJik ! That should solve it quite neatly. I will reopen the issue in case I run into any issues. |
Hello @LysandreJik , What is the difference between the following approaches?
Thank you in advance. |
Training a tokenizer from scratch would imply training a model from scratch as well - depending on the corpus used for the tokenizer, the tokens may be entirely different from another model's tokens trained on a similar corpus (except if you train the tokenizer using the exact same method and the exact same data). Adding tokens adds tokens at the end of the tokenizer's vocabulary, essentially extending the vocabulary. The model's embedding matrix would need to be resized as well to take into account the new tokens, but all the other tokens would keep their representation as-is. Seeing as the new rows in the embedding matrix are randomly initialized, you would still need to fine-tune the model to a dataset containing such tokens. |
@LysandreJik
Thanks! |
Hey I would like to fine-tune the model as you suggested at the end to the dataset containing such tokens. Can you help me out on how I can do that? |
If I add unknown tokens to the tokenizer and train the model on, say sentence pair similarity, while I suppose the new tokens embeddings will not have the correct relationship with other tokens, will the model output still be able to find similarity correctly given sufficient training on the model? |
@LysandreJik Thank you for your suggestion. However, I run into trouble because altering the embedding turns the embedding tensor into a non-leaf tensor and hence cannot be optimized i.e. model.embeddings.word_embeddings.weight.is_leaf # False I cannot figure out how to fix this (I am torch beginner; sorry). Do you have any suggestions? |
facing same issue; getting false for is_leaf |
|
Hi, |
Have you solved the problem? If so, can you share it with us? |
yes, it was because it takes a very long time to add all tokens. and I installed transformers from source: |
thank you!
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2021年5月10日(星期一) 下午2:11
收件人: ***@***.***>;
抄送: "Patrick ***@***.***>; ***@***.***>;
主题: Re: [huggingface/transformers] Adding New Vocabulary Tokens to the Models (#1413)
You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.
import torch from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained("bert-base-cased") model = BertModel.from_pretrained("bert-base-cased") print(len(tokenizer)) # 28996 tokenizer.add_tokens(["NEW_TOKEN"]) print(len(tokenizer)) # 28997 model.resize_token_embeddings(len(tokenizer)) # The new vector is added at the end of the embedding matrix print(model.embeddings.word_embeddings.weight[-1, :]) # Randomly generated matrix model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size]) print(model.embeddings.word_embeddings.weight[-1, :]) # outputs a vector of zeros of shape [768]
Hi,
I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it.
it may have lag or problem...
what should I do?
Have you solved the problem? If so, can you share it with us?
yes, it was because it takes a very long time to add all tokens. and I installed transformers from source:
pip install -U git+https://github.com/huggingface/transformers ,due to recently it was merged a PR that should speed this up dramatically and my problem solved.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Why can't we repurpose the existing 999 unused tokens [UNK] instead of extending the vocab size? |
@LysandreJik when I ran your code the following error popped up. please help RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation. |
You can fix that error by temporarily disabling gradient calculation. (Because initializing the weights is not an operation that needs to be accounted for in backpropagation.) with torch.no_grad():
model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size]) |
I finally chose the following solution: DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_UNK_TOKEN = "<unk>"
def tokenizer_embedding_resize(special_tokens_dict, tokenizer, model):
"""Resize tokenizer and embedding.
Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
"""
num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
if num_new_tokens > 0:
input_embeddings = model.get_input_embeddings().weight.data
output_embeddings = model.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
def add_special_token(tokenizer):
"""
Add special tokens to the tokenizer
"""
tokenizer.add_special_tokens(
{
"pad_token": DEFAULT_PAD_TOKEN,
"eos_token": DEFAULT_EOS_TOKEN,
"bos_token": DEFAULT_BOS_TOKEN,
"unk_token": DEFAULT_UNK_TOKEN,
}
)
return tokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
model_name,
cache_dir=save_dir,
model_max_length=train_max_len,
add_eos_token=True,
add_bos_token=True,
padding='longest',
padding_side="right",
truncation=True,
return_tensors="pt",
use_fast=False,
trust_remote_code=True,
use_auth_token=hf_auth_token,
device_map=device_map,
)
if tokenizer.pad_token is None:
tokenizer_embedding_resize(
special_tokens_dict=dict(pad_token="[PAD]"),
tokenizer=tokenizer,
model=model,
)
tokenizer = add_special_token(tokenizer)
# Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
model.resize_token_embeddings(len(tokenizer)) The reference codes for the The reference codes for the It works well on my side. Best regards, Shuyue |
Im unable to select the checkpoint. Since
If i give it just the directory i get |
Alright, you can't load a pipeline without the configuration and the required sub-checkpoints like here https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main (AFAIK). Would recommend you to ask on the |
@ArthurZucker #1413 (comment) Some context into my situation: My situation I believe is quite typical: I have a reasonably large domain-specific text dataset available, all in English, and my end goal is produce a domain-specific language model that can be used for various downstream NLP tasks, (e.g., text classification, sentence similarity, etc), after additional fine-tuning on those tasks. But from my current understanding, to first obtain that domain-specific language model, I basically have two options: train a tokenizer from scratch and then use that tokenizer to train a LM from scratch. I am starting to explore the second option, but I am confused on how to properly modify the vocabulary in a way that makes sense for the model, and concerned about what other side effects this could cause. To summarize: I'd really like to know if there is a low cost option for training a LM from scratch to do option 1 above |
Alright. If you need to add new tokens to the vocab but are not sure how, there are a few ways you can do this.
I think that's pretty much it 😓 |
So i asked over on diffusers and got no answer, then i asked again and got a response. All they did was argue over why i should not user their project for the exact reasons why the project exists... Basically said "Why are you trying to do this instead of being a sheep?", and they whont answer why the code is erroring out. It errors out at line 18.
|
|
Problem was the types.
This likely falls under transformers, since i think its just a text model issue.
What is "current model"? |
That just means the |
You mean the config.json file in the text_encoder folder? It already says the new number.
|
Hi @ArthurZucker,
Then I tried to add a token to the tokenizer
I didn't understand why the tokenizer behaved like this. If I add a token B that is a substring of another token A, does this imply that the tokenizer will not recognize the A, like in the example? |
# tokenizer.save_pretrained("model")
with open("filecontainstokens.txt","r") as file:
new_tokens = [line.strip() for line in file]
num_added_toks = tokenizer.add_tokens(new_tokens)
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))
average_embedding = torch.mean(model.get_input_embeddings().weight, axis=0)
for token_id in range(-num_added_toks, 0, 1):
model.get_input_embeddings().weight.data[token_id, :] = average_embedding
new_embedding_display = model.get_input_embeddings().weight[-1]
# print(new_embedding_display)
len(tokenizer) |
The problem seems to be that the tokens are never actually added to the text encoder, so the tokenizer says the new tokens, and the config says the new number, but the actual CLIPTextModel (the neural network?) doesn't match that. If i load the SD model, make some changes, save as Diffusers model, then convert to SD model. Then load it again and re-save it as Diffusers model, the tokenizer has the original tokens again before my changes. Clearly the values in the tokenizer is pulled from somewhere, the only place i can think of is the text encoder. |
You can control this using the from transformers import BertTokenizer, AddedToken
original_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
text = "California"
original_tokenizer.add_tokens([AddedToken("rn", single_word=True)], special_tokens=False)
original_tokens = original_tokenizer.tokenize(text) which does not work here. Feel free to open a new issue, but bert tokenizer is old so I am not suprised that this does not work. In [16]: from transformers import AutoTokenizer, AddedToken
...: original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
...:
...: text = "California"
...: original_tokenizer.add_tokens([AddedToken("rn", single_word=True)], special_tokens=False)
...: original_tokens = original_tokenizer.tokenize(text)
In [17]: original_tokens
Out[17]: ['California'] |
Thanks, @ArthurZucker; I will open an issue. I tried another one that exploits XLMRoberta ( |
Hi @ArthurZucker thank you for this insightful message. So after adding new tokens to the model, how do I get this new tokens to reflect in the tokenizer? Do I train a new tokenizer from scratch? If not, please are the specific modifications I need to the tokenizer? |
I am not sure I understand, if you add the tokens to the tokenizer, it should already relfect! |
Ahh, I get now, just updating the json file would suffice. I stumbled on your post while looking for answers on how to extend the vocab of an Nvidia-Nemo model. I know this is not directly related to this thread but I would appreciate if you share anything you know about this with me. |
😅 I am not familiar at all with the |
Hi @ArthurZucker, my target is to successfully merge:
Little intro to my problem. My aim is to build or adapt any open-source LLM to one or more Indian languages without losing its existing knowledge. Later, this can be fine-tuned for various downstream tasks in Indic languages. In my knowledge, there are two options.
So, far I figured out how to train a sentencepiece BPE tokenizer from scratch and then merge this with a pre-trained tokenizer. Here's the code for it:
The new merged tokenizer is working efficiently on both languages (english and telugu). I adapted this code from here: https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb Next, I trained a tiktoken's bytelevel tokenizer using the code from this repo: https://github.com/gautierdag/tokenizer-bench
Can you help me in figuring out the correct way to merge so that the telugu encoding performance is brought back to the level of the individual new tokenizer trained only on telugu. |
Seems related to huggingface/tokenizers#627 |
Hello, there is a new case I only want to training the the embedding the of new tokens, and keep the original embedding unchanged, is there a way to do this |
Of course, you need to simply freeze the embedding of the old tokens. What people usually do is this:
either you create a new embedding layer, or you add partial embeddings that are trainable ! |
Thanks for your reply! It really helps. I'll try it. |
So I did this input_embeddings = model.get_input_embeddings()
new_embed = IdeficsDecoupledEmbedding(num_embeddings=len(tokenizer), num_additional_embeddings=5000, embedding_dim=input_embeddings.weight.shape[1], partially_freeze=True)
model.set_input_embeddings(new_embed) But the gradients of the new model does not update during training. Do you know what could be wrong? |
No idea, this should do it! |
I think it might be some issues with how the Trainer sets the optimizer. |
Hi! I'm trying to add new tokens to the BERT tokenizer but facing some unexpected behavior. Setup: Expectation: What is observed: Any idea on this issue?
|
❓ Questions & Help
Hi,
How could I extend the vocabulary of the pre-trained models, e.g. by adding new tokens to the lookup table?
Any examples demonstrating this?
The text was updated successfully, but these errors were encountered: