Adding New Vocabulary Tokens to the Models #1413

vyraun · 2019-10-03T12:56:28Z

❓ Questions & Help

Hi,

How could I extend the vocabulary of the pre-trained models, e.g. by adding new tokens to the lookup table?

Any examples demonstrating this?

LysandreJik · 2019-10-03T18:41:16Z

Hi, I believe this method does exactly what you're looking for: add_tokens. There's an example right below it.

vyraun · 2019-10-03T18:46:09Z

thanks @LysandreJik ! yes, that's exactly what I was looking for. A follow-up question: How could I initialize the embeddings of these "new tokens" to something I already have pre-computed? I assume currently, embedding for these new tokens will be randomly initialized.

LysandreJik · 2019-10-03T19:08:49Z

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

vyraun · 2019-10-03T19:21:54Z

thanks @LysandreJik ! That should solve it quite neatly. I will reopen the issue in case I run into any issues.

celsofranssa · 2020-03-30T17:18:25Z

Hello @LysandreJik ,

What is the difference between the following approaches?

to train a tokenizer from scratch such as pointed in hugginface blog; or
to use add_tokens method?

Thank you in advance.

LysandreJik · 2020-04-02T19:33:22Z

Training a tokenizer from scratch would imply training a model from scratch as well - depending on the corpus used for the tokenizer, the tokens may be entirely different from another model's tokens trained on a similar corpus (except if you train the tokenizer using the exact same method and the exact same data).

Adding tokens adds tokens at the end of the tokenizer's vocabulary, essentially extending the vocabulary. The model's embedding matrix would need to be resized as well to take into account the new tokens, but all the other tokens would keep their representation as-is. Seeing as the new rows in the embedding matrix are randomly initialized, you would still need to fine-tune the model to a dataset containing such tokens.

PieterDujardin · 2020-04-03T13:12:35Z

@LysandreJik
I have a dutch medical dataset (for Namen Entity Recognition) which contains a lot of domain-specific words. The dutch BERT tokenizer therefor outputs a lot of [UNK] tokens when it tokenizes.
Given that I dispose over a corpus of 60k labelled tokens, and right now I have also a relatively small annotated corpus of 185k tokens, would it be best to:

just add the most frequent out of vocab words to the vocab of the tokenizer
start from a BERT checkpoint and do further pretraining on the unlabeled dataset (which is now of size 185k which is pretty small I assume..). There might be a possibility for me to obtain a much larger unannotated dataset of potentially millions of (unlabelled) tokens, but I was wondering if even millions of tokens is enough to do some meaningful further pretraining?

Thanks!

vinayannam · 2020-06-10T21:29:07Z

Training a tokenizer from scratch would imply training a model from scratch as well - depending on the corpus used for the tokenizer, the tokens may be entirely different from another model's tokens trained on a similar corpus (except if you train the tokenizer using the exact same method and the exact same data).

Adding tokens adds tokens at the end of the tokenizer's vocabulary, essentially extending the vocabulary. The model's embedding matrix would need to be resized as well to take into account the new tokens, but all the other tokens would keep their representation as-is. Seeing as the new rows in the embedding matrix are randomly initialized, you would still need to fine-tune the model to a dataset containing such tokens.

Hey I would like to fine-tune the model as you suggested at the end to the dataset containing such tokens. Can you help me out on how I can do that?

crispin-nosidam · 2020-07-30T15:01:48Z

If I add unknown tokens to the tokenizer and train the model on, say sentence pair similarity, while I suppose the new tokens embeddings will not have the correct relationship with other tokens, will the model output still be able to find similarity correctly given sufficient training on the model?

JensMadsen · 2020-09-04T12:49:32Z

@LysandreJik Thank you for your suggestion. However, I run into trouble because altering the embedding turns the embedding tensor into a non-leaf tensor and hence cannot be optimized i.e.

model.embeddings.word_embeddings.weight.is_leaf # False

I cannot figure out how to fix this (I am torch beginner; sorry). Do you have any suggestions?

vjagannath786 · 2020-09-23T19:45:18Z

facing same issue; getting false for is_leaf

HenryPaik1 · 2020-11-17T06:51:45Z

BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True).get_vocab() not return added token. How can I check if the new token is properly added to vocab dictionary?

ReySadeghi · 2021-04-13T12:55:04Z

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

Hi,
I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it.
it may have lag or problem...
what should I do?

zellford · 2021-05-09T12:52:39Z

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.
import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]
Hi,
I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it.
it may have lag or problem...
what should I do?

Have you solved the problem? If so, can you share it with us?

ReySadeghi · 2021-05-10T06:11:27Z

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.
import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]
Hi,
I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it.
it may have lag or problem...
what should I do?
Have you solved the problem? If so, can you share it with us?

yes, it was because it takes a very long time to add all tokens. and I installed transformers from source:
pip install -U git+https://github.com/huggingface/transformers ,due to recently it was merged a PR that should speed this up dramatically and my problem solved.

zellford · 2021-05-10T06:15:10Z

thank you!

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2021年5月10日(星期一) 下午2:11 收件人: ***@***.***>; 抄送: "Patrick ***@***.***>; ***@***.***>; 主题: Re: [huggingface/transformers] Adding New Vocabulary Tokens to the Models (#1413) You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel. import torch from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained("bert-base-cased") model = BertModel.from_pretrained("bert-base-cased") print(len(tokenizer)) # 28996 tokenizer.add_tokens(["NEW_TOKEN"]) print(len(tokenizer)) # 28997 model.resize_token_embeddings(len(tokenizer)) # The new vector is added at the end of the embedding matrix print(model.embeddings.word_embeddings.weight[-1, :]) # Randomly generated matrix model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size]) print(model.embeddings.word_embeddings.weight[-1, :]) # outputs a vector of zeros of shape [768] Hi, I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it. it may have lag or problem... what should I do? Have you solved the problem? If so, can you share it with us? yes, it was because it takes a very long time to add all tokens. and I installed transformers from source: pip install -U git+https://github.com/huggingface/transformers ,due to recently it was merged a PR that should speed this up dramatically and my problem solved. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

ptheru · 2021-07-29T21:53:52Z

Training a tokenizer from scratch would imply training a model from scratch as well - depending on the corpus used for the tokenizer, the tokens may be entirely different from another model's tokens trained on a similar corpus (except if you train the tokenizer using the exact same method and the exact same data).

Adding tokens adds tokens at the end of the tokenizer's vocabulary, essentially extending the vocabulary. The model's embedding matrix would need to be resized as well to take into account the new tokens, but all the other tokens would keep their representation as-is. Seeing as the new rows in the embedding matrix are randomly initialized, you would still need to fine-tune the model to a dataset containing such tokens.

Why can't we repurpose the existing 999 unused tokens [UNK] instead of extending the vocab size?
google-research/bert#9 (comment)

KairaNithin · 2021-08-06T00:53:57Z

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

@LysandreJik when I ran your code the following error popped up. please help

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

cm107 · 2021-08-19T02:09:16Z

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

You can fix that error by temporarily disabling gradient calculation. (Because initializing the weights is not an operation that needs to be accounted for in backpropagation.)

with torch.no_grad():
    model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

SuperBruceJia · 2023-12-20T17:52:26Z

I finally chose the following solution:

DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_UNK_TOKEN = "<unk>"

def tokenizer_embedding_resize(special_tokens_dict, tokenizer, model):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer))

    if num_new_tokens > 0:
        input_embeddings = model.get_input_embeddings().weight.data
        output_embeddings = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings[-num_new_tokens:] = input_embeddings_avg
        output_embeddings[-num_new_tokens:] = output_embeddings_avg


def add_special_token(tokenizer):
    """
    Add special tokens to the tokenizer
    """
    tokenizer.add_special_tokens(
        {
            "pad_token": DEFAULT_PAD_TOKEN,
            "eos_token": DEFAULT_EOS_TOKEN,
            "bos_token": DEFAULT_BOS_TOKEN,
            "unk_token": DEFAULT_UNK_TOKEN,
        }
    )

    return tokenizer


# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=save_dir,
    model_max_length=train_max_len,
    add_eos_token=True,
    add_bos_token=True,
    padding='longest',
    padding_side="right",
    truncation=True,
    return_tensors="pt",
    use_fast=False,
    trust_remote_code=True,
    use_auth_token=hf_auth_token,
    device_map=device_map,
)
if tokenizer.pad_token is None:
    tokenizer_embedding_resize(
        special_tokens_dict=dict(pad_token="[PAD]"),
        tokenizer=tokenizer,
        model=model,
    )
tokenizer = add_special_token(tokenizer)

# Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
model.resize_token_embeddings(len(tokenizer))

The reference codes for the tokenizer_embedding_resize(): https://github.com/meta-math/MetaMath/blob/main/train_math.py#L90-L110

The reference codes for the add_special_token(): https://github.com/meta-math/MetaMath/blob/main/train_math.py#L259-L279

It works well on my side.

Best regards,

Shuyue
Dec. 20th, 2023

TeKett · 2023-12-31T12:38:09Z

from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
import torch 

pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

pipe = StableDiffusionPipeline
num_new_tokens = pipeline.tokenizer.add_tokens(["new_token_1", "new_token_2"], special_tokens=True)

# simple resize
pipeline.text_encoder.resize_token_embeddings(len(pipeline.tokenizer))

# overwrite the content to have better results
input_embeddings = pipeline.text_encoder.get_input_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
pipeline.text_encoder.set_input_embeddings(input_embeddings)

should work 😉

Im unable to select the checkpoint. Since StableDiffusionPipeline.from_pretrained wants a path to a directory containing a pipeline object. I dont have that, what even is that? I cant stress enough that all i have is a SD 1.5 checkpoint, you know the one you load up into a1111 to generate images, that can be trained using Kohya, that are shared on civitai. I dont have a pipline object, and the Clip model i want to add tokens to is packaged inside of a .safetensor file.

ValueError: The provided pretrained_model_name_or_path "C:/Train/checkpoint.safetensors" is neither a valid local path nor a valid repo id.

If i give it just the directory i get OSError: Error no file named model_index.json found in directory

ArthurZucker · 2024-01-03T06:27:11Z

Alright, you can't load a pipeline without the configuration and the required sub-checkpoints like here https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main (AFAIK). Would recommend you to ask on the diffusers repo as this is outside the scope of transformers 🤗

kumarme072 · 2024-01-05T13:08:30Z

@ArthurZucker
I have come across many similar issues asking about how to add new tokens to a vocabulary, for reference, here are a couple links to useful comments made for doing roughly that:

#1413 (comment)
#2691 (comment)
huggingface/tokenizers#627 (comment)
However, I am concerned with how to first identify tokens that make sense to add to an existing tokenizer's vocabulary, and also possibly whether or not it makes sense to consider removing tokens from a vocabulary.

Some context into my situation:

My situation I believe is quite typical: I have a reasonably large domain-specific text dataset available, all in English, and my end goal is produce a domain-specific language model that can be used for various downstream NLP tasks, (e.g., text classification, sentence similarity, etc), after additional fine-tuning on those tasks.

But from my current understanding, to first obtain that domain-specific language model, I basically have two options:

train a tokenizer from scratch and then use that tokenizer to train a LM from scratch.
modify the vocabulary of a pretrained tokenizer, adjust a (also pretrained) LM's embedding matrix to work with this new vocab size, and fine-tune the pretrained LM on my domain-specific text dataset on something like MLM.
I am struggling with the first option because (as far as I know) training a language model from scratch is quite expensive, and although I do have some budget for this, I do not have on the order of thousands of dollars.

I am starting to explore the second option, but I am confused on how to properly modify the vocabulary in a way that makes sense for the model, and concerned about what other side effects this could cause.

To summarize:

I'd really like to know if there is a low cost option for training a LM from scratch to do option 1 above
Or, if option 2 makes more sense, how to properly modify a vocabulary (find good new tokens, remove unused ones, etc), and adapt the model to overcome potential negative side effects of messing with the embeddings.
Thanks for the help. Sorry for a long question but I thought some context may be needed since I might be asking the wrong question in the first place. Cheers. ##asked by someone.

ArthurZucker · 2024-01-08T14:50:48Z

Alright. If you need to add new tokens to the vocab but are not sure how, there are a few ways you can do this.

Train a new tokenizer, using https://huggingface.co/learn/nlp-course/chapter6/2#training-a-new-tokenizer. This will make use of

transformers/src/transformers/tokenization_utils_fast.py

Line 687 in 3eddda1

def train_new_from_iterator(

If you have a language specific data that uses none of the "old" tokens, that might be okay, but otherwises as you mentioned you would need to retrain the model.
Train a new small tokenizer on a small corpus, merge the new vocab with the old vocab (merge the vocab and the merges if it is a BPE tokenizer by just adding the new tokens at the end) More on that here How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer? tokenizers#1109. Might not be optimal but if certain languages have less tokens it should be alright.
Manually add all the new tokens using add_tokens(), which will just be adding characters / words for simplicity. Growing the vocab exponentially potentially if the vocab of the language is huge.

I think that's pretty much it 😓

TeKett · 2024-02-10T14:25:37Z

So i asked over on diffusers and got no answer, then i asked again and got a response. All they did was argue over why i should not user their project for the exact reasons why the project exists... Basically said "Why are you trying to do this instead of being a sheep?", and they whont answer why the code is erroring out.

It errors out at line 18. cannot assign 'torch.FloatTensor' as child module 'token_embedding'

from diffusers import StableDiffusionPipeline

array = []
with open("D:/tagstest.txt",encoding="utf8") as file:
        array = [row.rstrip("\n") for row in file.readlines()]

pipeline = StableDiffusionPipeline.from_single_file("C:/Train/checkpoint.safetensors")

num_new_tokens = pipeline.tokenizer.add_tokens(array, special_tokens=False)

# simple resize (is this correct?)
pipeline.text_encoder.resize_token_embeddings(len(pipeline.tokenizer))

# overwrite the content to have better results
input_embeddings = pipeline.text_encoder.get_input_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
pipeline.text_encoder.set_input_embeddings(input_embeddings) # Error
pipeline.model.save_pretrained("c:/test")

ArthurZucker · 2024-02-12T05:56:09Z

pipeline.text_encoder.set_input_embeddings(input_embeddings) should be given a nn.Embedding if I am not mistakend. Thus you first get_input_embedding change the data, and then set_input_embeddings

TeKett · 2024-02-13T20:53:47Z

Problem was the types. .weight.data is a torch.FloatTensor, while get_input_embeddings() is a nn.whatever. I could just ommit this completely, no? Since all it does it unlearn the model?

input_embeddings = pipeline.text_encoder.get_input_embeddings()
input_embeddings_avg = input_embeddings.weight.data[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings.weight.data[-num_new_tokens:] = input_embeddings_avg
pipeline.text_encoder.set_input_embeddings(input_embeddings)

This likely falls under transformers, since i think its just a text model issue.
When i try to load the model again im getting

RuntimeError: Error(s) in loading state_dict for CLIPTextModel:
	size mismatch for text_model.embeddings.token_embedding.weight: copying a param with shape torch.Size([90323, 768]) from checkpoint, the shape in current model is torch.Size([49408, 768]).

What is "current model"?

ArthurZucker · 2024-02-19T03:11:26Z

That just means the config.vocab_size is wrong and should be updated to 90323 current model is the one initialized with the config

TeKett · 2024-02-19T04:40:08Z

That just means the config.vocab_size is wrong and should be updated to 90323 current model is the one initialized with the config

You mean the config.json file in the text_encoder folder? It already says the new number.

{
  "architectures": [
    "CLIPTextModel"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "dropout": 0.0,
  "eos_token_id": 2,
  "hidden_act": "quick_gelu",
  "hidden_size": 768,
  "initializer_factor": 1.0,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 77,
  "model_type": "clip_text_model",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "projection_dim": 768,
  "torch_dtype": "float32",
  "transformers_version": "4.35.2",
  "vocab_size": 90320
}

feliperviegas · 2024-03-05T17:51:09Z

Alright. If you need to add new tokens to the vocab but are not sure how, there are a few ways you can do this.

Train a new tokenizer, using https://huggingface.co/learn/nlp-course/chapter6/2#training-a-new-tokenizer. This will make use of

transformers/src/transformers/tokenization_utils_fast.py

Line 687 in 3eddda1

def train_new_from_iterator(

If you have a language specific data that uses none of the "old" tokens, that might be okay, but otherwises as you mentioned you would need to retrain the model.

Train a new small tokenizer on a small corpus, merge the new vocab with the old vocab (merge the vocab and the merges if it is a BPE tokenizer by just adding the new tokens at the end) More on that here How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer? tokenizers#1109. Might not be optimal but if certain languages have less tokens it should be alright.

Manually add all the new tokens using add_tokens(), which will just be adding characters / words for simplicity. Growing the vocab exponentially potentially if the vocab of the language is huge.

I think that's pretty much it 😓

Hi @ArthurZucker,
I tried to extend the tokenizer vocabulary using add_tokens method, but I got an odd behavior, not sure if I used correctly. I will try to demonstrate in the following example:

from transformers import BertTokenizer
original_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

text = "California"
original_tokens = original_tokenizer.tokenize(text)
original_tokens # And here the tokenizer know the token, returning it with no issues.

Then I tried to add a token to the tokenizer

from transformers import BertTokenizer
original_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

text = "California"
original_tokenizer.add_tokens(["rn"], special_tokens=False)
original_tokens = original_tokenizer.tokenize(text) # And here it returns ['Cal', '##if', '##o', 'rn', 'i', '##a']

I didn't understand why the tokenizer behaved like this. If I add a token B that is a substring of another token A, does this imply that the tokenizer will not recognize the A, like in the example?

kumarme072 · 2024-03-05T19:02:50Z

# tokenizer.save_pretrained("model")

with open("filecontainstokens.txt","r") as file:
    new_tokens = [line.strip() for line in file]

num_added_toks = tokenizer.add_tokens(new_tokens)
print('We have added', num_added_toks, 'tokens')

model.resize_token_embeddings(len(tokenizer))

average_embedding = torch.mean(model.get_input_embeddings().weight, axis=0)
for token_id in range(-num_added_toks, 0, 1):
    model.get_input_embeddings().weight.data[token_id, :] = average_embedding

new_embedding_display = model.get_input_embeddings().weight[-1]
# print(new_embedding_display)

len(tokenizer)

TeKett · 2024-03-06T07:13:53Z

That just means the config.vocab_size is wrong and should be updated to 90323 current model is the one initialized with the config

The problem seems to be that the tokens are never actually added to the text encoder, so the tokenizer says the new tokens, and the config says the new number, but the actual CLIPTextModel (the neural network?) doesn't match that. If i load the SD model, make some changes, save as Diffusers model, then convert to SD model. Then load it again and re-save it as Diffusers model, the tokenizer has the original tokens again before my changes. Clearly the values in the tokenizer is pulled from somewhere, the only place i can think of is the text encoder.

ArthurZucker · 2024-03-06T23:52:44Z

I didn't understand why the tokenizer behaved like this. If I add a token B that is a substring of another token A, does this imply that the tokenizer will not recognize the A, like in the example?

You can control this using the single_word option of the AddedToken:

from transformers import BertTokenizer, AddedToken
original_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

text = "California"
original_tokenizer.add_tokens([AddedToken("rn", single_word=True)], special_tokens=False)
original_tokens = original_tokenizer.tokenize(text)

which does not work here. Feel free to open a new issue, but bert tokenizer is old so I am not suprised that this does not work.

In [16]: from transformers import AutoTokenizer, AddedToken
    ...: original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
    ...: 
    ...: text = "California"
    ...: original_tokenizer.add_tokens([AddedToken("rn", single_word=True)], special_tokens=False)
    ...: original_tokens = original_tokenizer.tokenize(text)

In [17]: original_tokens
Out[17]: ['California']

feliperviegas · 2024-03-13T16:46:32Z

I didn't understand why the tokenizer behaved like this. If I add a token B that is a substring of another token A, does this imply that the tokenizer will not recognize the A, like in the example?

You can control this using the single_word option of the AddedToken:
from transformers import BertTokenizer, AddedToken
original_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

text = "California"
original_tokenizer.add_tokens([AddedToken("rn", single_word=True)], special_tokens=False)
original_tokens = original_tokenizer.tokenize(text) 
which does not work here. Feel free to open a new issue, but bert tokenizer is old so I am not suprised that this does not work.
In [16]: from transformers import AutoTokenizer, AddedToken
    ...: original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
    ...: 
    ...: text = "California"
    ...: original_tokenizer.add_tokens([AddedToken("rn", single_word=True)], special_tokens=False)
    ...: original_tokens = original_tokenizer.tokenize(text)

In [17]: original_tokens
Out[17]: ['California']

Thanks, @ArthurZucker; I will open an issue. I tried another one that exploits XLMRoberta (intfloat/multilingual-e5-small) and got the same behavior.

owos · 2024-03-28T14:18:51Z

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

Hi @ArthurZucker thank you for this insightful message. So after adding new tokens to the model, how do I get this new tokens to reflect in the tokenizer? Do I train a new tokenizer from scratch? If not, please are the specific modifications I need to the tokenizer?

ArthurZucker · 2024-03-28T14:22:04Z

I am not sure I understand, if you add the tokens to the tokenizer, it should already relfect!

owos · 2024-03-28T14:36:57Z

I am not sure I understand, if you add the tokens to the tokenizer, it should already relfect!

Ahh, I get now, just updating the json file would suffice. I stumbled on your post while looking for answers on how to extend the vocab of an Nvidia-Nemo model. I know this is not directly related to this thread but I would appreciate if you share anything you know about this with me.

ArthurZucker · 2024-03-30T17:11:52Z

😅 I am not familiar at all with the nemo library, so no idea here!

savanth14 · 2024-04-25T06:31:30Z

Hi @ArthurZucker, my target is to successfully merge:

A sentencepiece BPE tokenizer trained from scratch on a custom domain and use it to extend the vocab of a pre-trained tokenizer of the same family - COMPLETED
A tiktoken based bytelevel BPE tokenizer trained from scratch and use it to extend the vocab of Llama 3 tokenizer - STUCK HERE

Little intro to my problem. My aim is to build or adapt any open-source LLM to one or more Indian languages without losing its existing knowledge. Later, this can be fine-tuned for various downstream tasks in Indic languages. In my knowledge, there are two options.

Build a tokenizer and a LLM from scratch using all the english and Indic language corpus - Like we all know, this is super expensive and not a feasible option for me.
Train a new tokenizer either using sentencepiece or huggingface repo on my custom language corpus. Then, extend the vocab of any existing tokenizer associated with a corresponding pre-trained LLM. Finally, resize the embedding layer of the LLM and continue pre-training on my custom language corpus. - This is a feasible option for me.

So, far I figured out how to train a sentencepiece BPE tokenizer from scratch and then merge this with a pre-trained tokenizer. Here's the code for it:

# Load the pre-trained tokenizer to be extended
original_tokenizer_path = hf_hub_download(repo_id="mistralai/mistral-7b-v0.1", filename="tokenizer.model", local_dir="original_tokenizer")
original_tokenizer_spm = sp_pb2_model.ModelProto()
original_tokenizer_spm.ParseFromString(open(original_tokenizer_path, "rb").read())

# Load the newly trained tokenizer
new_tokenizer_spm = sp_pb2_model.ModelProto()
new_tokenizer_spm.ParseFromString(open("/content/mistral_tel_tokenizer.model", "rb").read())


# Check if the new tokenizer contains english tokens
def contains_eng(text):
    eng_pattern = re.compile(r"[\u0020-\u007E]+")
    return True if eng_pattern.search(text) else False


original_tokenizer_tokenset = set(p.piece for p in original_tokenizer_spm.pieces)
print(f"Number of tokens before merge: {len(original_tokenizer_tokenset)}")
for p in new_tokenizer_spm.pieces:
    piece = p.piece
    if piece not in original_tokenizer_tokenset and not contains_eng(piece):
        new_p = sp_pb2_model.ModelProto().SentencePiece()
        new_p.piece = piece
        new_p.score = 0
        original_tokenizer_spm.pieces.append(new_p)
print(f"Number of tokens after merge: {len(original_tokenizer_spm.pieces)}")

# Save the extended tokenizer to a checkpoint
extended_tokenizer_save_path="/content/english-telugu-tokenizer"
os.makedirs(extended_tokenizer_save_path, exist_ok=True)
with open(os.path.join(extended_tokenizer_save_path, "tokenizer.model"), "wb") as f:
    f.write(original_tokenizer_spm.SerializeToString())

The new merged tokenizer is working efficiently on both languages (english and telugu). I adapted this code from here: https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb

Next, I trained a tiktoken's bytelevel tokenizer using the code from this repo: https://github.com/gautierdag/tokenizer-bench
PROBLEM
However, when I merged this new tokenizer with a pre-trained one, the telugu encoding performance got degraded. The merged tokenizer is splitting telugu text into too many fine-grained tokens. Encoding-Decoding for english, and decoding for telugu are working fine. Here's the code I used for merging:

import json

def merge_tokenizers(file1, file2, output_file):
    # Load the tokenizers
    with open(file1, 'r') as f:
        tokenizer1 = json.load(f)
    with open(file2, 'r') as f:
        tokenizer2 = json.load(f)

    # Get the maximum rank in tokenizer1's vocab
    max_rank = max(tokenizer1['model']['vocab'].values())

    # Combine the vocabs and merges
    combined_vocab = tokenizer1['model']['vocab'].copy()
    for token, rank in tokenizer2['model']['vocab'].items():
        if token not in combined_vocab:
            combined_vocab[token] = len(combined_vocab) + 1


    combined_merges = tokenizer1['model']['merges'].copy()
    for merge in tokenizer2['model']['merges']:
        if merge not in combined_merges:
            combined_merges.append(merge)

    # combined_merges = tokenizer1['model']['merges'].copy()
    # combined_merges.extend(merge for merge in tokenizer2['model']['merges'] if merge not in combined_merges)

    # Update the vocab and merges in tokenizer1
    tokenizer1['model']['vocab'] = combined_vocab
    tokenizer1['model']['merges'] = combined_merges

    # Save the updated tokenizer
    with open(output_file, 'w') as f:
        json.dump(tokenizer1, f)

# Usage
merge_tokenizers("/content/gpt_32k.json", "/content/telugu_tokenizer_tiktoken.json", 'tokenizer_18.json')

Can you help me in figuring out the correct way to merge so that the telugu encoding performance is brought back to the level of the individual new tokenizer trained only on telugu.
I know, there's something wrong with the token and rank. In the case of sentencepiece, it is token and score. You have to initiate all the newly appended tokens with a score of 0. Here, I am not able to figure out what to do?

ArthurZucker · 2024-04-30T10:29:42Z

Seems related to huggingface/tokenizers#627

xin-ran-w · 2024-10-10T06:31:57Z

Hello, there is a new case I only want to training the the embedding the of new tokens, and keep the original embedding unchanged, is there a way to do this

ArthurZucker · 2024-10-10T07:34:19Z

Of course, you need to simply freeze the embedding of the old tokens. What people usually do is this:

transformers/src/transformers/models/idefics/modeling_idefics.py

Line 246 in fff268d

class IdeficsDecoupledEmbedding(nn.Embedding):

either you create a new embedding layer, or you add partial embeddings that are trainable !

xin-ran-w · 2024-10-10T11:31:06Z

Thanks for your reply! It really helps. I'll try it.

owos · 2024-10-10T21:24:16Z

Of course, you need to simply freeze the embedding of the old tokens. What people usually do is this:

transformers/src/transformers/models/idefics/modeling_idefics.py

Line 246 in fff268d

class IdeficsDecoupledEmbedding(nn.Embedding):

either you create a new embedding layer, or you add partial embeddings that are trainable !

So I did this

input_embeddings = model.get_input_embeddings()
new_embed = IdeficsDecoupledEmbedding(num_embeddings=len(tokenizer), num_additional_embeddings=5000, embedding_dim=input_embeddings.weight.shape[1], partially_freeze=True)
model.set_input_embeddings(new_embed)

But the gradients of the new model does not update during training. Do you know what could be wrong?

ArthurZucker · 2024-10-17T16:25:39Z

No idea, this should do it!

owos · 2024-10-17T16:30:49Z

I think it might be some issues with how the Trainer sets the optimizer.

KarlosMuradyan · 2025-02-19T19:39:20Z

Hi! I'm trying to add new tokens to the BERT tokenizer but facing some unexpected behavior.

Setup:
Initially, inspect, ec, ##ec, ##t tokens are present in the vocabulary and I'm adding insp to it.

Expectation:
When tokenizing inspect, I expect inspect to be the output token as BERT uses WordPiece tokenization that finds the longest subword that is in the vocabulary, then splits on it from left to right.

What is observed:
inspect is split into three tokens: insp, ec, ##t. This not only doesn't match with the expectation, but also uses ec token instead of ##ec token, falsely indicating that insp and ect are separate words.

Any idea on this issue?

>> from transformers import BertTokenizer

>> model_checkpoint = 'bert-base-uncased'
>> tokenizer = BertTokenizer.from_pretrained(model_checkpoint)

>> ('inspect' in tokenizer.vocab, 'insp' in tokenizer.vocab, 'ec' in tokenizer.vocab, '##ec' in tokenizer.vocab, '##t' in tokenizer.vocab)
# (True, False, True, True, True)
>> tokenizer.convert_ids_to_tokens(tokenizer.encode('inspect'))
# ['[CLS]', 'inspect', '[SEP]']

>> tokenizer.add_tokens(['insp'])
# 1
>> ('inspect' in tokenizer.vocab, 'insp' in tokenizer.vocab, 'ec' in tokenizer.vocab, '##ec' in tokenizer.vocab, '##t' in tokenizer.vocab)
# (True, True, True, True, True)
>> tokenizer.convert_ids_to_tokens(tokenizer.encode('inspect'))
# ['[CLS]', 'insp', 'ec', '##t', '[SEP]']

vyraun closed this as completed Oct 3, 2019

rzepinskip mentioned this issue Mar 30, 2020

Vocabulary checks rzepinskip/spoiler-detection#9

Open

nreimers mentioned this issue Jul 29, 2020

How to use different Transformer model for each sentence embedding UKPLab/sentence-transformers#328

Closed

patrickvonplaten mentioned this issue Dec 23, 2020

how can I change the AlbertModel's vocab #9270

Closed

anferico mentioned this issue Feb 11, 2021

Extend tokenizer vocabulary with new words huggingface/tokenizers#627

Closed

singletongue mentioned this issue Mar 12, 2021

How to add new vocabulary to vocab.txt cl-tohoku/bert-japanese#23

Open

nreimers mentioned this issue May 10, 2021

Improve Tokenizer for uppercase text UKPLab/sentence-transformers#917

Open

Alaa-1 mentioned this issue Jun 21, 2021

GoEmotions religion and name masking resources google-research/google-research#521

Open

brijow mentioned this issue Jun 29, 2021

Questions on modifying a vocabulary vs. training a LM from scratch huggingface/tokenizers#747

Closed

NielsRogge mentioned this issue Jul 12, 2021

Vocab Size does not change when adding new tokens #12632

Closed

nreimers mentioned this issue Jul 22, 2021

Training paraphrase mining model UKPLab/sentence-transformers#1033

Open

kumarme072 mentioned this issue Jan 5, 2024

Adding domain specific vocabulary google-research/bert#9

Closed

billnye2 mentioned this issue Aug 4, 2024

FLUX LoRA Training: text encoder training? bghira/SimpleTuner#633

Closed

Adding New Vocabulary Tokens to the Models #1413

Adding New Vocabulary Tokens to the Models #1413

Comments

vyraun commented Oct 3, 2019

❓ Questions & Help

LysandreJik commented Oct 3, 2019

vyraun commented Oct 3, 2019

LysandreJik commented Oct 3, 2019

vyraun commented Oct 3, 2019

celsofranssa commented Mar 30, 2020

LysandreJik commented Apr 2, 2020

PieterDujardin commented Apr 3, 2020

vinayannam commented Jun 10, 2020

crispin-nosidam commented Jul 30, 2020 • edited Loading

JensMadsen commented Sep 4, 2020

vjagannath786 commented Sep 23, 2020

HenryPaik1 commented Nov 17, 2020

ReySadeghi commented Apr 13, 2021

zellford commented May 9, 2021

ReySadeghi commented May 10, 2021

zellford commented May 10, 2021 via email

ptheru commented Jul 29, 2021 • edited Loading

KairaNithin commented Aug 6, 2021

cm107 commented Aug 19, 2021

SuperBruceJia commented Dec 20, 2023 • edited Loading

TeKett commented Dec 31, 2023

ArthurZucker commented Jan 3, 2024

kumarme072 commented Jan 5, 2024

ArthurZucker commented Jan 8, 2024

TeKett commented Feb 10, 2024 • edited Loading

ArthurZucker commented Feb 12, 2024

TeKett commented Feb 13, 2024

ArthurZucker commented Feb 19, 2024

TeKett commented Feb 19, 2024

feliperviegas commented Mar 5, 2024

kumarme072 commented Mar 5, 2024 • edited by ArthurZucker Loading

TeKett commented Mar 6, 2024 • edited Loading

ArthurZucker commented Mar 6, 2024

feliperviegas commented Mar 13, 2024

owos commented Mar 28, 2024

ArthurZucker commented Mar 28, 2024

owos commented Mar 28, 2024

ArthurZucker commented Mar 30, 2024

savanth14 commented Apr 25, 2024

ArthurZucker commented Apr 30, 2024

xin-ran-w commented Oct 10, 2024

ArthurZucker commented Oct 10, 2024

xin-ran-w commented Oct 10, 2024

owos commented Oct 10, 2024 • edited Loading

ArthurZucker commented Oct 17, 2024

owos commented Oct 17, 2024

KarlosMuradyan commented Feb 19, 2025

crispin-nosidam commented Jul 30, 2020 •

edited

Loading

ptheru commented Jul 29, 2021 •

edited

Loading

SuperBruceJia commented Dec 20, 2023 •

edited

Loading

TeKett commented Feb 10, 2024 •

edited

Loading

kumarme072 commented Mar 5, 2024 •

edited by ArthurZucker

Loading

TeKett commented Mar 6, 2024 •

edited

Loading

owos commented Oct 10, 2024 •

edited

Loading