Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding New Vocabulary Tokens to the Models #1413

Closed
vyraun opened this issue Oct 3, 2019 · 67 comments
Closed

Adding New Vocabulary Tokens to the Models #1413

vyraun opened this issue Oct 3, 2019 · 67 comments

Comments

@vyraun
Copy link

vyraun commented Oct 3, 2019

❓ Questions & Help

Hi,

How could I extend the vocabulary of the pre-trained models, e.g. by adding new tokens to the lookup table?

Any examples demonstrating this?

@LysandreJik
Copy link
Member

Hi, I believe this method does exactly what you're looking for: add_tokens. There's an example right below it.

@vyraun
Copy link
Author

vyraun commented Oct 3, 2019

thanks @LysandreJik ! yes, that's exactly what I was looking for. A follow-up question: How could I initialize the embeddings of these "new tokens" to something I already have pre-computed? I assume currently, embedding for these new tokens will be randomly initialized.

@LysandreJik
Copy link
Member

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

@vyraun
Copy link
Author

vyraun commented Oct 3, 2019

thanks @LysandreJik ! That should solve it quite neatly. I will reopen the issue in case I run into any issues.

@celsofranssa
Copy link

Hello @LysandreJik ,

What is the difference between the following approaches?

  1. to train a tokenizer from scratch such as pointed in hugginface blog; or
  2. to use add_tokens method?

Thank you in advance.

@LysandreJik
Copy link
Member

Training a tokenizer from scratch would imply training a model from scratch as well - depending on the corpus used for the tokenizer, the tokens may be entirely different from another model's tokens trained on a similar corpus (except if you train the tokenizer using the exact same method and the exact same data).

Adding tokens adds tokens at the end of the tokenizer's vocabulary, essentially extending the vocabulary. The model's embedding matrix would need to be resized as well to take into account the new tokens, but all the other tokens would keep their representation as-is. Seeing as the new rows in the embedding matrix are randomly initialized, you would still need to fine-tune the model to a dataset containing such tokens.

@PieterDujardin
Copy link

@LysandreJik
I have a dutch medical dataset (for Namen Entity Recognition) which contains a lot of domain-specific words. The dutch BERT tokenizer therefor outputs a lot of [UNK] tokens when it tokenizes.
Given that I dispose over a corpus of 60k labelled tokens, and right now I have also a relatively small annotated corpus of 185k tokens, would it be best to:

  • just add the most frequent out of vocab words to the vocab of the tokenizer
  • start from a BERT checkpoint and do further pretraining on the unlabeled dataset (which is now of size 185k which is pretty small I assume..). There might be a possibility for me to obtain a much larger unannotated dataset of potentially millions of (unlabelled) tokens, but I was wondering if even millions of tokens is enough to do some meaningful further pretraining?

Thanks!

@vinayannam
Copy link

Training a tokenizer from scratch would imply training a model from scratch as well - depending on the corpus used for the tokenizer, the tokens may be entirely different from another model's tokens trained on a similar corpus (except if you train the tokenizer using the exact same method and the exact same data).

Adding tokens adds tokens at the end of the tokenizer's vocabulary, essentially extending the vocabulary. The model's embedding matrix would need to be resized as well to take into account the new tokens, but all the other tokens would keep their representation as-is. Seeing as the new rows in the embedding matrix are randomly initialized, you would still need to fine-tune the model to a dataset containing such tokens.

Hey I would like to fine-tune the model as you suggested at the end to the dataset containing such tokens. Can you help me out on how I can do that?

@crispin-nosidam
Copy link

crispin-nosidam commented Jul 30, 2020

If I add unknown tokens to the tokenizer and train the model on, say sentence pair similarity, while I suppose the new tokens embeddings will not have the correct relationship with other tokens, will the model output still be able to find similarity correctly given sufficient training on the model?

@JensMadsen
Copy link

@LysandreJik Thank you for your suggestion. However, I run into trouble because altering the embedding turns the embedding tensor into a non-leaf tensor and hence cannot be optimized i.e.

model.embeddings.word_embeddings.weight.is_leaf # False

I cannot figure out how to fix this (I am torch beginner; sorry). Do you have any suggestions?

@vjagannath786
Copy link

facing same issue; getting false for is_leaf

@HenryPaik1
Copy link

BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True).get_vocab() not return added token. How can I check if the new token is properly added to vocab dictionary?

@ReySadeghi
Copy link

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

Hi,
I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it.
it may have lag or problem...
what should I do?

@zellford
Copy link

zellford commented May 9, 2021

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

Hi,
I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it.
it may have lag or problem...
what should I do?

Have you solved the problem? If so, can you share it with us?

@ReySadeghi
Copy link

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

Hi,
I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it.
it may have lag or problem...
what should I do?

Have you solved the problem? If so, can you share it with us?

yes, it was because it takes a very long time to add all tokens. and I installed transformers from source:
pip install -U git+https://github.com/huggingface/transformers ,due to recently it was merged a PR that should speed this up dramatically and my problem solved.

@zellford
Copy link

zellford commented May 10, 2021 via email

@ptheru
Copy link

ptheru commented Jul 29, 2021

Training a tokenizer from scratch would imply training a model from scratch as well - depending on the corpus used for the tokenizer, the tokens may be entirely different from another model's tokens trained on a similar corpus (except if you train the tokenizer using the exact same method and the exact same data).

Adding tokens adds tokens at the end of the tokenizer's vocabulary, essentially extending the vocabulary. The model's embedding matrix would need to be resized as well to take into account the new tokens, but all the other tokens would keep their representation as-is. Seeing as the new rows in the embedding matrix are randomly initialized, you would still need to fine-tune the model to a dataset containing such tokens.

Why can't we repurpose the existing 999 unused tokens [UNK] instead of extending the vocab size?
google-research/bert#9 (comment)

@KairaNithin
Copy link

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

@LysandreJik when I ran your code the following error popped up. please help

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

@cm107
Copy link

cm107 commented Aug 19, 2021

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

You can fix that error by temporarily disabling gradient calculation. (Because initializing the weights is not an operation that needs to be accounted for in backpropagation.)

with torch.no_grad():
    model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

@SuperBruceJia
Copy link

SuperBruceJia commented Dec 20, 2023

I finally chose the following solution:

DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_UNK_TOKEN = "<unk>"

def tokenizer_embedding_resize(special_tokens_dict, tokenizer, model):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer))

    if num_new_tokens > 0:
        input_embeddings = model.get_input_embeddings().weight.data
        output_embeddings = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings[-num_new_tokens:] = input_embeddings_avg
        output_embeddings[-num_new_tokens:] = output_embeddings_avg


def add_special_token(tokenizer):
    """
    Add special tokens to the tokenizer
    """
    tokenizer.add_special_tokens(
        {
            "pad_token": DEFAULT_PAD_TOKEN,
            "eos_token": DEFAULT_EOS_TOKEN,
            "bos_token": DEFAULT_BOS_TOKEN,
            "unk_token": DEFAULT_UNK_TOKEN,
        }
    )

    return tokenizer


# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=save_dir,
    model_max_length=train_max_len,
    add_eos_token=True,
    add_bos_token=True,
    padding='longest',
    padding_side="right",
    truncation=True,
    return_tensors="pt",
    use_fast=False,
    trust_remote_code=True,
    use_auth_token=hf_auth_token,
    device_map=device_map,
)
if tokenizer.pad_token is None:
    tokenizer_embedding_resize(
        special_tokens_dict=dict(pad_token="[PAD]"),
        tokenizer=tokenizer,
        model=model,
    )
tokenizer = add_special_token(tokenizer)

# Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
model.resize_token_embeddings(len(tokenizer))

The reference codes for the tokenizer_embedding_resize(): https://github.com/meta-math/MetaMath/blob/main/train_math.py#L90-L110

The reference codes for the add_special_token(): https://github.com/meta-math/MetaMath/blob/main/train_math.py#L259-L279

It works well on my side.

Best regards,

Shuyue
Dec. 20th, 2023

@TeKett
Copy link

TeKett commented Dec 31, 2023

from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
import torch 

pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

pipe = StableDiffusionPipeline
num_new_tokens = pipeline.tokenizer.add_tokens(["new_token_1", "new_token_2"], special_tokens=True)

# simple resize
pipeline.text_encoder.resize_token_embeddings(len(pipeline.tokenizer))

# overwrite the content to have better results
input_embeddings = pipeline.text_encoder.get_input_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
pipeline.text_encoder.set_input_embeddings(input_embeddings)

should work 😉

Im unable to select the checkpoint. Since StableDiffusionPipeline.from_pretrained wants a path to a directory containing a pipeline object. I dont have that, what even is that? I cant stress enough that all i have is a SD 1.5 checkpoint, you know the one you load up into a1111 to generate images, that can be trained using Kohya, that are shared on civitai. I dont have a pipline object, and the Clip model i want to add tokens to is packaged inside of a .safetensor file.

ValueError: The provided pretrained_model_name_or_path "C:/Train/checkpoint.safetensors" is neither a valid local path nor a valid repo id.

If i give it just the directory i get OSError: Error no file named model_index.json found in directory

@ArthurZucker
Copy link
Collaborator

Alright, you can't load a pipeline without the configuration and the required sub-checkpoints like here https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main (AFAIK). Would recommend you to ask on the diffusers repo as this is outside the scope of transformers 🤗

@kumarme072
Copy link

@ArthurZucker
I have come across many similar issues asking about how to add new tokens to a vocabulary, for reference, here are a couple links to useful comments made for doing roughly that:

#1413 (comment)
#2691 (comment)
huggingface/tokenizers#627 (comment)
However, I am concerned with how to first identify tokens that make sense to add to an existing tokenizer's vocabulary, and also possibly whether or not it makes sense to consider removing tokens from a vocabulary.

Some context into my situation:

My situation I believe is quite typical: I have a reasonably large domain-specific text dataset available, all in English, and my end goal is produce a domain-specific language model that can be used for various downstream NLP tasks, (e.g., text classification, sentence similarity, etc), after additional fine-tuning on those tasks.

But from my current understanding, to first obtain that domain-specific language model, I basically have two options:

train a tokenizer from scratch and then use that tokenizer to train a LM from scratch.
modify the vocabulary of a pretrained tokenizer, adjust a (also pretrained) LM's embedding matrix to work with this new vocab size, and fine-tune the pretrained LM on my domain-specific text dataset on something like MLM.
I am struggling with the first option because (as far as I know) training a language model from scratch is quite expensive, and although I do have some budget for this, I do not have on the order of thousands of dollars.

I am starting to explore the second option, but I am confused on how to properly modify the vocabulary in a way that makes sense for the model, and concerned about what other side effects this could cause.

To summarize:

I'd really like to know if there is a low cost option for training a LM from scratch to do option 1 above
Or, if option 2 makes more sense, how to properly modify a vocabulary (find good new tokens, remove unused ones, etc), and adapt the model to overcome potential negative side effects of messing with the embeddings.
Thanks for the help. Sorry for a long question but I thought some context may be needed since I might be asking the wrong question in the first place. Cheers. ##asked by someone.

@ArthurZucker
Copy link
Collaborator

Alright. If you need to add new tokens to the vocab but are not sure how, there are a few ways you can do this.

  1. Train a new tokenizer, using https://huggingface.co/learn/nlp-course/chapter6/2#training-a-new-tokenizer. This will make use of
    def train_new_from_iterator(
    If you have a language specific data that uses none of the "old" tokens, that might be okay, but otherwises as you mentioned you would need to retrain the model.
  2. Train a new small tokenizer on a small corpus, merge the new vocab with the old vocab (merge the vocab and the merges if it is a BPE tokenizer by just adding the new tokens at the end) More on that here How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer?  tokenizers#1109. Might not be optimal but if certain languages have less tokens it should be alright.
  3. Manually add all the new tokens using add_tokens(), which will just be adding characters / words for simplicity. Growing the vocab exponentially potentially if the vocab of the language is huge.

I think that's pretty much it 😓

@TeKett
Copy link

TeKett commented Feb 10, 2024

So i asked over on diffusers and got no answer, then i asked again and got a response. All they did was argue over why i should not user their project for the exact reasons why the project exists... Basically said "Why are you trying to do this instead of being a sheep?", and they whont answer why the code is erroring out.

It errors out at line 18. cannot assign 'torch.FloatTensor' as child module 'token_embedding'

from diffusers import StableDiffusionPipeline

array = []
with open("D:/tagstest.txt",encoding="utf8") as file:
        array = [row.rstrip("\n") for row in file.readlines()]

pipeline = StableDiffusionPipeline.from_single_file("C:/Train/checkpoint.safetensors")

num_new_tokens = pipeline.tokenizer.add_tokens(array, special_tokens=False)

# simple resize (is this correct?)
pipeline.text_encoder.resize_token_embeddings(len(pipeline.tokenizer))

# overwrite the content to have better results
input_embeddings = pipeline.text_encoder.get_input_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
pipeline.text_encoder.set_input_embeddings(input_embeddings) # Error
pipeline.model.save_pretrained("c:/test")

@ArthurZucker
Copy link
Collaborator

pipeline.text_encoder.set_input_embeddings(input_embeddings) should be given a nn.Embedding if I am not mistakend. Thus you first get_input_embedding change the data, and then set_input_embeddings

@TeKett
Copy link

TeKett commented Feb 13, 2024

Problem was the types. .weight.data is a torch.FloatTensor, while get_input_embeddings() is a nn.whatever. I could just ommit this completely, no? Since all it does it unlearn the model?

input_embeddings = pipeline.text_encoder.get_input_embeddings()
input_embeddings_avg = input_embeddings.weight.data[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings.weight.data[-num_new_tokens:] = input_embeddings_avg
pipeline.text_encoder.set_input_embeddings(input_embeddings)

This likely falls under transformers, since i think its just a text model issue.
When i try to load the model again im getting

RuntimeError: Error(s) in loading state_dict for CLIPTextModel:
	size mismatch for text_model.embeddings.token_embedding.weight: copying a param with shape torch.Size([90323, 768]) from checkpoint, the shape in current model is torch.Size([49408, 768]).

What is "current model"?

@ArthurZucker
Copy link
Collaborator

That just means the config.vocab_size is wrong and should be updated to 90323 current model is the one initialized with the config

@TeKett
Copy link

TeKett commented Feb 19, 2024

That just means the config.vocab_size is wrong and should be updated to 90323 current model is the one initialized with the config

You mean the config.json file in the text_encoder folder? It already says the new number.

{
  "architectures": [
    "CLIPTextModel"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "dropout": 0.0,
  "eos_token_id": 2,
  "hidden_act": "quick_gelu",
  "hidden_size": 768,
  "initializer_factor": 1.0,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 77,
  "model_type": "clip_text_model",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "projection_dim": 768,
  "torch_dtype": "float32",
  "transformers_version": "4.35.2",
  "vocab_size": 90320
}

@feliperviegas
Copy link

Alright. If you need to add new tokens to the vocab but are not sure how, there are a few ways you can do this.

  1. Train a new tokenizer, using https://huggingface.co/learn/nlp-course/chapter6/2#training-a-new-tokenizer. This will make use of
    def train_new_from_iterator(

    If you have a language specific data that uses none of the "old" tokens, that might be okay, but otherwises as you mentioned you would need to retrain the model.
  2. Train a new small tokenizer on a small corpus, merge the new vocab with the old vocab (merge the vocab and the merges if it is a BPE tokenizer by just adding the new tokens at the end) More on that here How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer?  tokenizers#1109. Might not be optimal but if certain languages have less tokens it should be alright.
  3. Manually add all the new tokens using add_tokens(), which will just be adding characters / words for simplicity. Growing the vocab exponentially potentially if the vocab of the language is huge.

I think that's pretty much it 😓

Hi @ArthurZucker,
I tried to extend the tokenizer vocabulary using add_tokens method, but I got an odd behavior, not sure if I used correctly. I will try to demonstrate in the following example:

from transformers import BertTokenizer
original_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

text = "California"
original_tokens = original_tokenizer.tokenize(text)
original_tokens # And here the tokenizer know the token, returning it with no issues.

Then I tried to add a token to the tokenizer

from transformers import BertTokenizer
original_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

text = "California"
original_tokenizer.add_tokens(["rn"], special_tokens=False)
original_tokens = original_tokenizer.tokenize(text) # And here it returns ['Cal', '##if', '##o', 'rn', 'i', '##a']

I didn't understand why the tokenizer behaved like this. If I add a token B that is a substring of another token A, does this imply that the tokenizer will not recognize the A, like in the example?

@kumarme072
Copy link

kumarme072 commented Mar 5, 2024

# tokenizer.save_pretrained("model")

with open("filecontainstokens.txt","r") as file:
    new_tokens = [line.strip() for line in file]

num_added_toks = tokenizer.add_tokens(new_tokens)
print('We have added', num_added_toks, 'tokens')

model.resize_token_embeddings(len(tokenizer))

average_embedding = torch.mean(model.get_input_embeddings().weight, axis=0)
for token_id in range(-num_added_toks, 0, 1):
    model.get_input_embeddings().weight.data[token_id, :] = average_embedding

new_embedding_display = model.get_input_embeddings().weight[-1]
# print(new_embedding_display)

len(tokenizer)

@TeKett
Copy link

TeKett commented Mar 6, 2024

That just means the config.vocab_size is wrong and should be updated to 90323 current model is the one initialized with the config

The problem seems to be that the tokens are never actually added to the text encoder, so the tokenizer says the new tokens, and the config says the new number, but the actual CLIPTextModel (the neural network?) doesn't match that. If i load the SD model, make some changes, save as Diffusers model, then convert to SD model. Then load it again and re-save it as Diffusers model, the tokenizer has the original tokens again before my changes. Clearly the values in the tokenizer is pulled from somewhere, the only place i can think of is the text encoder.

@ArthurZucker
Copy link
Collaborator

I didn't understand why the tokenizer behaved like this. If I add a token B that is a substring of another token A, does this imply that the tokenizer will not recognize the A, like in the example?

You can control this using the single_word option of the AddedToken:

from transformers import BertTokenizer, AddedToken
original_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

text = "California"
original_tokenizer.add_tokens([AddedToken("rn", single_word=True)], special_tokens=False)
original_tokens = original_tokenizer.tokenize(text) 

which does not work here. Feel free to open a new issue, but bert tokenizer is old so I am not suprised that this does not work.

In [16]: from transformers import AutoTokenizer, AddedToken
    ...: original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
    ...: 
    ...: text = "California"
    ...: original_tokenizer.add_tokens([AddedToken("rn", single_word=True)], special_tokens=False)
    ...: original_tokens = original_tokenizer.tokenize(text)

In [17]: original_tokens
Out[17]: ['California']

@feliperviegas
Copy link

I didn't understand why the tokenizer behaved like this. If I add a token B that is a substring of another token A, does this imply that the tokenizer will not recognize the A, like in the example?

You can control this using the single_word option of the AddedToken:

from transformers import BertTokenizer, AddedToken
original_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

text = "California"
original_tokenizer.add_tokens([AddedToken("rn", single_word=True)], special_tokens=False)
original_tokens = original_tokenizer.tokenize(text) 

which does not work here. Feel free to open a new issue, but bert tokenizer is old so I am not suprised that this does not work.

In [16]: from transformers import AutoTokenizer, AddedToken
    ...: original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
    ...: 
    ...: text = "California"
    ...: original_tokenizer.add_tokens([AddedToken("rn", single_word=True)], special_tokens=False)
    ...: original_tokens = original_tokenizer.tokenize(text)

In [17]: original_tokens
Out[17]: ['California']

Thanks, @ArthurZucker; I will open an issue. I tried another one that exploits XLMRoberta (intfloat/multilingual-e5-small) and got the same behavior.

@owos
Copy link

owos commented Mar 28, 2024

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

Hi @ArthurZucker thank you for this insightful message. So after adding new tokens to the model, how do I get this new tokens to reflect in the tokenizer? Do I train a new tokenizer from scratch? If not, please are the specific modifications I need to the tokenizer?

@ArthurZucker
Copy link
Collaborator

I am not sure I understand, if you add the tokens to the tokenizer, it should already relfect!

@owos
Copy link

owos commented Mar 28, 2024

I am not sure I understand, if you add the tokens to the tokenizer, it should already relfect!

Ahh, I get now, just updating the json file would suffice. I stumbled on your post while looking for answers on how to extend the vocab of an Nvidia-Nemo model. I know this is not directly related to this thread but I would appreciate if you share anything you know about this with me.

@ArthurZucker
Copy link
Collaborator

😅 I am not familiar at all with the nemo library, so no idea here!

@savanth14
Copy link

Hi @ArthurZucker, my target is to successfully merge:

  1. A sentencepiece BPE tokenizer trained from scratch on a custom domain and use it to extend the vocab of a pre-trained tokenizer of the same family - COMPLETED
  2. A tiktoken based bytelevel BPE tokenizer trained from scratch and use it to extend the vocab of Llama 3 tokenizer - STUCK HERE

Little intro to my problem. My aim is to build or adapt any open-source LLM to one or more Indian languages without losing its existing knowledge. Later, this can be fine-tuned for various downstream tasks in Indic languages. In my knowledge, there are two options.

  1. Build a tokenizer and a LLM from scratch using all the english and Indic language corpus - Like we all know, this is super expensive and not a feasible option for me.
  2. Train a new tokenizer either using sentencepiece or huggingface repo on my custom language corpus. Then, extend the vocab of any existing tokenizer associated with a corresponding pre-trained LLM. Finally, resize the embedding layer of the LLM and continue pre-training on my custom language corpus. - This is a feasible option for me.

So, far I figured out how to train a sentencepiece BPE tokenizer from scratch and then merge this with a pre-trained tokenizer. Here's the code for it:

# Load the pre-trained tokenizer to be extended
original_tokenizer_path = hf_hub_download(repo_id="mistralai/mistral-7b-v0.1", filename="tokenizer.model", local_dir="original_tokenizer")
original_tokenizer_spm = sp_pb2_model.ModelProto()
original_tokenizer_spm.ParseFromString(open(original_tokenizer_path, "rb").read())

# Load the newly trained tokenizer
new_tokenizer_spm = sp_pb2_model.ModelProto()
new_tokenizer_spm.ParseFromString(open("/content/mistral_tel_tokenizer.model", "rb").read())


# Check if the new tokenizer contains english tokens
def contains_eng(text):
    eng_pattern = re.compile(r"[\u0020-\u007E]+")
    return True if eng_pattern.search(text) else False


original_tokenizer_tokenset = set(p.piece for p in original_tokenizer_spm.pieces)
print(f"Number of tokens before merge: {len(original_tokenizer_tokenset)}")
for p in new_tokenizer_spm.pieces:
    piece = p.piece
    if piece not in original_tokenizer_tokenset and not contains_eng(piece):
        new_p = sp_pb2_model.ModelProto().SentencePiece()
        new_p.piece = piece
        new_p.score = 0
        original_tokenizer_spm.pieces.append(new_p)
print(f"Number of tokens after merge: {len(original_tokenizer_spm.pieces)}")

# Save the extended tokenizer to a checkpoint
extended_tokenizer_save_path="/content/english-telugu-tokenizer"
os.makedirs(extended_tokenizer_save_path, exist_ok=True)
with open(os.path.join(extended_tokenizer_save_path, "tokenizer.model"), "wb") as f:
    f.write(original_tokenizer_spm.SerializeToString())

The new merged tokenizer is working efficiently on both languages (english and telugu). I adapted this code from here: https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb

Next, I trained a tiktoken's bytelevel tokenizer using the code from this repo: https://github.com/gautierdag/tokenizer-bench
PROBLEM
However, when I merged this new tokenizer with a pre-trained one, the telugu encoding performance got degraded. The merged tokenizer is splitting telugu text into too many fine-grained tokens. Encoding-Decoding for english, and decoding for telugu are working fine. Here's the code I used for merging:

import json

def merge_tokenizers(file1, file2, output_file):
    # Load the tokenizers
    with open(file1, 'r') as f:
        tokenizer1 = json.load(f)
    with open(file2, 'r') as f:
        tokenizer2 = json.load(f)

    # Get the maximum rank in tokenizer1's vocab
    max_rank = max(tokenizer1['model']['vocab'].values())

    # Combine the vocabs and merges
    combined_vocab = tokenizer1['model']['vocab'].copy()
    for token, rank in tokenizer2['model']['vocab'].items():
        if token not in combined_vocab:
            combined_vocab[token] = len(combined_vocab) + 1


    combined_merges = tokenizer1['model']['merges'].copy()
    for merge in tokenizer2['model']['merges']:
        if merge not in combined_merges:
            combined_merges.append(merge)

    # combined_merges = tokenizer1['model']['merges'].copy()
    # combined_merges.extend(merge for merge in tokenizer2['model']['merges'] if merge not in combined_merges)

    # Update the vocab and merges in tokenizer1
    tokenizer1['model']['vocab'] = combined_vocab
    tokenizer1['model']['merges'] = combined_merges

    # Save the updated tokenizer
    with open(output_file, 'w') as f:
        json.dump(tokenizer1, f)

# Usage
merge_tokenizers("/content/gpt_32k.json", "/content/telugu_tokenizer_tiktoken.json", 'tokenizer_18.json')

Can you help me in figuring out the correct way to merge so that the telugu encoding performance is brought back to the level of the individual new tokenizer trained only on telugu.
I know, there's something wrong with the token and rank. In the case of sentencepiece, it is token and score. You have to initiate all the newly appended tokens with a score of 0. Here, I am not able to figure out what to do?

@ArthurZucker
Copy link
Collaborator

Seems related to huggingface/tokenizers#627

@xin-ran-w
Copy link

Hello, there is a new case I only want to training the the embedding the of new tokens, and keep the original embedding unchanged, is there a way to do this

@ArthurZucker
Copy link
Collaborator

Of course, you need to simply freeze the embedding of the old tokens. What people usually do is this:

class IdeficsDecoupledEmbedding(nn.Embedding):

either you create a new embedding layer, or you add partial embeddings that are trainable !

@xin-ran-w
Copy link

Thanks for your reply! It really helps. I'll try it.

@owos
Copy link

owos commented Oct 10, 2024

Of course, you need to simply freeze the embedding of the old tokens. What people usually do is this:

class IdeficsDecoupledEmbedding(nn.Embedding):

either you create a new embedding layer, or you add partial embeddings that are trainable !

So I did this

input_embeddings = model.get_input_embeddings()
new_embed = IdeficsDecoupledEmbedding(num_embeddings=len(tokenizer), num_additional_embeddings=5000, embedding_dim=input_embeddings.weight.shape[1], partially_freeze=True)
model.set_input_embeddings(new_embed)

But the gradients of the new model does not update during training. Do you know what could be wrong?

@ArthurZucker
Copy link
Collaborator

No idea, this should do it!

@owos
Copy link

owos commented Oct 17, 2024

I think it might be some issues with how the Trainer sets the optimizer.

@KarlosMuradyan
Copy link

Hi! I'm trying to add new tokens to the BERT tokenizer but facing some unexpected behavior.

Setup:
Initially, inspect, ec, ##ec, ##t tokens are present in the vocabulary and I'm adding insp to it.

Expectation:
When tokenizing inspect, I expect inspect to be the output token as BERT uses WordPiece tokenization that finds the longest subword that is in the vocabulary, then splits on it from left to right.

What is observed:
inspect is split into three tokens: insp, ec, ##t. This not only doesn't match with the expectation, but also uses ec token instead of ##ec token, falsely indicating that insp and ect are separate words.

Any idea on this issue?

>> from transformers import BertTokenizer

>> model_checkpoint = 'bert-base-uncased'
>> tokenizer = BertTokenizer.from_pretrained(model_checkpoint)

>> ('inspect' in tokenizer.vocab, 'insp' in tokenizer.vocab, 'ec' in tokenizer.vocab, '##ec' in tokenizer.vocab, '##t' in tokenizer.vocab)
# (True, False, True, True, True)
>> tokenizer.convert_ids_to_tokens(tokenizer.encode('inspect'))
# ['[CLS]', 'inspect', '[SEP]']

>> tokenizer.add_tokens(['insp'])
# 1
>> ('inspect' in tokenizer.vocab, 'insp' in tokenizer.vocab, 'ec' in tokenizer.vocab, '##ec' in tokenizer.vocab, '##t' in tokenizer.vocab)
# (True, True, True, True, True)
>> tokenizer.convert_ids_to_tokens(tokenizer.encode('inspect'))
# ['[CLS]', 'insp', 'ec', '##t', '[SEP]']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests