how can i finetune BertTokenizer? #2691

raj5287 · 2020-01-31T09:23:11Z

Is it possible to fine tune BertTokenizer so that the new vocab.txt file which it uses gets updated on my custom dataset? or do i need to retrain the bert model from scratch for the same?

cronoik · 2020-02-01T10:53:23Z

You can add new words to the tokenizer with add_tokens:
tokenizer.add_tokens(['newWord', 'newWord2'])
After that you need to resize the dictionary size of the embedding layer with:
model.resize_token_embeddings(len(tokenizer))

BramVanroy · 2020-02-01T14:39:32Z

You can add new words to the tokenizer with add_tokens:
tokenizer.add_tokens(['newWord', 'newWord2'])
After that you need to resize the dictionary size of the embedding layer with:
model.resize_token_embeddings(len(tokenizer))

Note that this simply adds a new token to the vocabulary but doesn't train its embedding (obviously). This implies that your results will be quite poor if your training data contains a lot of newly added (untrained) tokens.

raj5287 · 2020-02-04T04:55:21Z

@cronoik once the dictionary is resized don't I have to train the tokenizer model again?

@BramVanroy umm.. so what could be the probable solution if I am having a custom data set? How can I can retrain this BertTokenizer Model to get new vocab.txt file?

cronoik · 2020-02-18T14:00:32Z

What do you mean with tokenizer model? The tokenizer in simple terms is a class which splits your text in tokens from a huge dictionary. What you have to train is the embedding layer of your model because the weights of the new tokens will be random. This will happen during the training of your model (but it could be undertrained for the new tokens).

In case you have a plenty of new words (e.g. technical terms) or even a different language, it might makes sense to start from scratch (definitely for the later). Here is blogpost from huggingface which shows you how to train a tokenizer+model for Esperanto: link. It really depends on your data (e.g. number of new tokens, importance of new tokens, relation between the tokens...).

stale · 2020-04-18T14:35:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

INF800 · 2021-02-01T10:49:31Z

Hi @cronoik,

I tried replacing RobertaTokenizerFast with DistilBertTokenizerFast

from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("/content/EsperBERTo", max_len=512)

worked absolutely fine. But,

from transformers import DistilBertConfig

config = DistilBertConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    #num_attention_heads=12,
    #num_hidden_layers=6,
    #type_vocab_size=1,
)

from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained("/content/EsperBERTo", max_len=512)

throws error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-17-7f80e1d47bf5> in <module>()
      1 from transformers import DistilBertTokenizerFast
      2 
----> 3 tokenizer = DistilBertTokenizerFast.from_pretrained("/content/EsperBERTo", max_len=512)

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1772                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing relevant tokenizer files\n\n"
   1773             )
-> 1774             raise EnvironmentError(msg)
   1775 
   1776         for file_id, file_path in vocab_files.items():

OSError: Can't load tokenizer for '/content/EsperBERTo'. Make sure that:

- '/content/EsperBERTo' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/content/EsperBERTo' is the correct path to a directory containing relevant tokenizer files

Can I know what is the best way to add vocabulary into DistilBertTokenizer?

cronoik · 2021-02-01T12:17:00Z

@rakesh4real What is /content/EsperBERTo? Which files are in this directory? Please keep in mind that Roberta uses a BPE tokenizer, while Bert a WordpieceTokenizer. You can't simply use different kinds of tokenization with the same configuration files.

INF800 · 2021-02-02T17:39:13Z

Thank you @cronoik. I did not know tokenizers needed to be changed. Are there any references where I can learn what tokenizers must be used for a given model / task.

And I had to use different special tokens as well. Kindly let me know where to find what special tokens must be used (when and why)

Using BertWordpieceTokenizer the code runs just perfect. Added code here

cronoik · 2021-02-16T13:02:08Z

@rakesh4real This site [1] gives you a general overview of different tokenization approaches and the site for each model tells you which tokenization algorithm was used (e.g. [2] for BERT).

[1] https://huggingface.co/transformers/tokenizer_summary.html
[2] https://huggingface.co/transformers/model_doc/bert.html#berttokenizer

tolgayan · 2021-08-24T06:14:56Z

What do you mean with tokenizer model? The tokenizer in simple terms is a class which splits your text in tokens from a huge dictionary. What you have to train is the embedding layer of your model because the weights of the new tokens will be random. This will happen during the training of your model (but it could be undertrained for the new tokens).

In case you have a plenty of new words (e.g. technical terms) or even a different language, it might makes sense to start from scratch (definitely for the later). Here is blogpost from huggingface which shows you how to train a tokenizer+model for Esperanto: link. It really depends on your data (e.g. number of new tokens, importance of new tokens, relation between the tokens...).

As far as I understand there are two options mentioned. The first one is training from scratch using tokenizer.train(files, trainer). But this method requires training the Bert model from scratch too, as mentioned in #747. And the second option is extending the vocabulary as @cronoik said, but this leads to the problem @BramVanroy mentioned.

The options are either train from scratch or, randomly initialize embeddings of the tokenizer and hope for a good performance. Isn't it possible to finetune the model to train the embeddings of these newly added tokens? Why does it have to be either using random embeddings, or training from scratch? Am I missing something?

Thanks in advance.

cronoik · 2021-08-28T14:27:48Z

The options are either train from scratch or, randomly initialize embeddings of the tokenizer and hope for a good performance. Isn't it possible to finetune the model to train the embeddings of these newly added tokens? Why does it have to be either using random embeddings, or training from scratch? Am I missing something?

resize_token_embeddings does not reset the embedding layer, it just extends it. The new tokens are randomly initialized and you need to train them:

In case you have only a few new tokens, you can do it during finetuning
In case you have a lot of new tokens, you should probably train your model with the pretraining objective that was used to train the model the first time. You might want to add an additional word_embedding layer for the new tokens and freeze all other layers to save some time.
in case you have a lot a lot new tokens (like a new language that is not related to the original language of your model), you should probably train a model from scratch.

@tolgayan

tolgayan · 2021-08-30T10:09:03Z

Thank you for the nice and clear explanation! @cronoik

ma-batita · 2022-03-08T11:38:06Z

Hello,
I know it is has been a long period since the last comment in this issue but I couldn't hold it and I have ti ask @cronoik.
Could you, please, explain more what do you mean by...

In case you have only a few new tokens, you can do it during finetuning

by 'during finetuning', you mean the new tokens will be randomly initialised first and then the embedding with update during the model training?

For my cas I have a list of emojis (all the emojis that we have so it is a size of 3,633 emojis) and the vocab_size of my tokenizer is 32005. does this make the 'few new tokens' of not? should I consider training my model from scratch?

thanks in advance!

naarkhoo · 2022-06-16T14:33:46Z

I still don't know how one can finetune tokenizer - by finetuning I don't mean just adding words to the dictionary - but also updating the embedding.

I am dealing with a text classification - since the text uses informal language (Arabic) e.g. salam, vs saloom or sssaam - a lot of vowels spell out differently. do I have to train a new language model from scrach ?! or I can use the existing model and finetune ?

tolgayan · 2022-06-17T05:27:00Z

Tokenizers are nothing but seperators. They split the sentences into subparts. The most common splitting method is using the whitespaces. When we say "training a tokenizer", it actually creates a vocabulary from a given text data. It assigns an id to each token so that you can feed these tokens as numbers to a BERT model. When you tokenize a sentence with a so-called "pretrained" tokenizer, it splits the sentence with its splitting algorithm, and assigns ids to each token from its vocabulary. Sometimes, it encounters with unknown words. In this case, it further splits that word further to meaningful subparts, that the subparts are in the vocabulary. They generally looks like: "Hou## ##se". The purpose of this "training" operation is to prevent the tokenizer from splitting important or domain-specific tokens so that the meaning will be kept.

Back to your question. When you have some specific words that need to be in the vocabulary, you can directly add them to the vocabulary, and they will be assigned with ids, continuing from the last id in the vocabulary I guess (I would be happy if somebody verify me here.) But the main problem is your model does not know what to do with these new numbers. In this case, the embeddings will be created randomly for these tokens. Here you have three options as @cronoik suggested to train the embedding layer for these new tokens.

You can leave them, and while finetuning, the model figure out what to do with these new tokens by updating the embedding layer.
You can add a new embedding layer, and freeze all the previous layers. Then finetune the model with the same task of the base model so that the new layer will cover your new embeddings.
You can start from scratch, adding your tokens to the training corpus, initializing the tokenizer from ground, and pretrain a language model from scratch.

don-tpanic · 2023-08-14T15:17:59Z

Tokenizers are nothing but seperators. They split the sentences into subparts. The most common splitting method is using the whitespaces. When we say "training a tokenizer", it actually creates a vocabulary from a given text data. It assigns an id to each token so that you can feed these tokens as numbers to a BERT model. When you tokenize a sentence with a so-called "pretrained" tokenizer, it splits the sentence with its splitting algorithm, and assigns ids to each token from its vocabulary. Sometimes, it encounters with unknown words. In this case, it further splits that word further to meaningful subparts, that the subparts are in the vocabulary. They generally looks like: "Hou## ##se". The purpose of this "training" operation is to prevent the tokenizer from splitting important or domain-specific tokens so that the meaning will be kept.

Back to your question. When you have some specific words that need to be in the vocabulary, you can directly add them to the vocabulary, and they will be assigned with ids, continuing from the last id in the vocabulary I guess (I would be happy if somebody verify me here.) But the main problem is your model does not know what to do with these new numbers. In this case, the embeddings will be created randomly for these tokens. Here you have three options as @cronoik suggested to train the embedding layer for these new tokens.

You can leave them, and while finetuning, the model figure out what to do with these new tokens by updating the embedding layer.

You can add a new embedding layer, and freeze all the previous layers. Then finetune the model with the same task of the base model so that the new layer will cover your new embeddings.

You can start from scratch, adding your tokens to the training corpus, initializing the tokenizer from ground, and pretrain a language model from scratch.

Thanks a lot for your explanation. I suppose, if I go for the first approach where I fine-tune my embedding layer, it would be a good idea to fine-tune the entire embedding layer not just the newly added entries that correspond to my new tokens? Or perhaps I should only allow gradients for those newly added entries?

stale bot added the wontfix label Apr 18, 2020

stale bot closed this as completed Apr 25, 2020

anferico mentioned this issue Feb 11, 2021

Extend tokenizer vocabulary with new words huggingface/tokenizers#627

Closed

brijow mentioned this issue Jun 29, 2021

Questions on modifying a vocabulary vs. training a LM from scratch huggingface/tokenizers#747

Closed

PeterM18 mentioned this issue Feb 9, 2023

Adding words to vocabulary and training for a new domain huggingface/tokenizers#1160

Closed

johnml1135 mentioned this issue Jul 27, 2023

Update tokenizer to accept new characters for Huggingface models sillsdev/machine.py#20

Closed

This was referenced Jan 5, 2024

Adding domain specific vocabulary google-research/bert#9

Closed

Adding New Vocabulary Tokens to the Models #1413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how can i finetune BertTokenizer? #2691

how can i finetune BertTokenizer? #2691

raj5287 commented Jan 31, 2020

cronoik commented Feb 1, 2020 •

edited

Loading

BramVanroy commented Feb 1, 2020

raj5287 commented Feb 4, 2020 •

edited

Loading

cronoik commented Feb 18, 2020 •

edited

Loading

stale bot commented Apr 18, 2020

INF800 commented Feb 1, 2021 •

edited

Loading

cronoik commented Feb 1, 2021

INF800 commented Feb 2, 2021 •

edited

Loading

cronoik commented Feb 16, 2021

tolgayan commented Aug 24, 2021 •

edited

Loading

cronoik commented Aug 28, 2021 •

edited

Loading

tolgayan commented Aug 30, 2021

ma-batita commented Mar 8, 2022

naarkhoo commented Jun 16, 2022

tolgayan commented Jun 17, 2022 •

edited

Loading

don-tpanic commented Aug 14, 2023

how can i finetune BertTokenizer? #2691

how can i finetune BertTokenizer? #2691

Comments

raj5287 commented Jan 31, 2020

cronoik commented Feb 1, 2020 • edited Loading

BramVanroy commented Feb 1, 2020

raj5287 commented Feb 4, 2020 • edited Loading

cronoik commented Feb 18, 2020 • edited Loading

stale bot commented Apr 18, 2020

INF800 commented Feb 1, 2021 • edited Loading

cronoik commented Feb 1, 2021

INF800 commented Feb 2, 2021 • edited Loading

cronoik commented Feb 16, 2021

tolgayan commented Aug 24, 2021 • edited Loading

cronoik commented Aug 28, 2021 • edited Loading

tolgayan commented Aug 30, 2021

ma-batita commented Mar 8, 2022

naarkhoo commented Jun 16, 2022

tolgayan commented Jun 17, 2022 • edited Loading

don-tpanic commented Aug 14, 2023

cronoik commented Feb 1, 2020 •

edited

Loading

raj5287 commented Feb 4, 2020 •

edited

Loading

cronoik commented Feb 18, 2020 •

edited

Loading

INF800 commented Feb 1, 2021 •

edited

Loading

INF800 commented Feb 2, 2021 •

edited

Loading

tolgayan commented Aug 24, 2021 •

edited

Loading

cronoik commented Aug 28, 2021 •

edited

Loading

tolgayan commented Jun 17, 2022 •

edited

Loading