Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can i finetune BertTokenizer? #2691

Closed
raj5287 opened this issue Jan 31, 2020 · 16 comments
Closed

how can i finetune BertTokenizer? #2691

raj5287 opened this issue Jan 31, 2020 · 16 comments
Labels

Comments

@raj5287
Copy link

raj5287 commented Jan 31, 2020

Is it possible to fine tune BertTokenizer so that the new vocab.txt file which it uses gets updated on my custom dataset? or do i need to retrain the bert model from scratch for the same?

@cronoik
Copy link
Contributor

cronoik commented Feb 1, 2020

You can add new words to the tokenizer with add_tokens:
tokenizer.add_tokens(['newWord', 'newWord2'])
After that you need to resize the dictionary size of the embedding layer with:
model.resize_token_embeddings(len(tokenizer))

@BramVanroy
Copy link
Collaborator

You can add new words to the tokenizer with add_tokens:
tokenizer.add_tokens(['newWord', 'newWord2'])
After that you need to resize the dictionary size of the embedding layer with:
model.resize_token_embeddings(len(tokenizer))

Note that this simply adds a new token to the vocabulary but doesn't train its embedding (obviously). This implies that your results will be quite poor if your training data contains a lot of newly added (untrained) tokens.

@raj5287
Copy link
Author

raj5287 commented Feb 4, 2020

@cronoik once the dictionary is resized don't I have to train the tokenizer model again?

@BramVanroy umm.. so what could be the probable solution if I am having a custom data set? How can I can retrain this BertTokenizer Model to get new vocab.txt file?

@cronoik
Copy link
Contributor

cronoik commented Feb 18, 2020

What do you mean with tokenizer model? The tokenizer in simple terms is a class which splits your text in tokens from a huge dictionary. What you have to train is the embedding layer of your model because the weights of the new tokens will be random. This will happen during the training of your model (but it could be undertrained for the new tokens).

In case you have a plenty of new words (e.g. technical terms) or even a different language, it might makes sense to start from scratch (definitely for the later). Here is blogpost from huggingface which shows you how to train a tokenizer+model for Esperanto: link. It really depends on your data (e.g. number of new tokens, importance of new tokens, relation between the tokens...).

@stale
Copy link

stale bot commented Apr 18, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Apr 18, 2020
@stale stale bot closed this as completed Apr 25, 2020
@INF800
Copy link

INF800 commented Feb 1, 2021

Hi @cronoik,

I tried replacing RobertaTokenizerFast with DistilBertTokenizerFast

from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("/content/EsperBERTo", max_len=512)

worked absolutely fine. But,

from transformers import DistilBertConfig

config = DistilBertConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    #num_attention_heads=12,
    #num_hidden_layers=6,
    #type_vocab_size=1,
)

from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained("/content/EsperBERTo", max_len=512)

throws error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-17-7f80e1d47bf5> in <module>()
      1 from transformers import DistilBertTokenizerFast
      2 
----> 3 tokenizer = DistilBertTokenizerFast.from_pretrained("/content/EsperBERTo", max_len=512)

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1772                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing relevant tokenizer files\n\n"
   1773             )
-> 1774             raise EnvironmentError(msg)
   1775 
   1776         for file_id, file_path in vocab_files.items():

OSError: Can't load tokenizer for '/content/EsperBERTo'. Make sure that:

- '/content/EsperBERTo' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/content/EsperBERTo' is the correct path to a directory containing relevant tokenizer files

Can I know what is the best way to add vocabulary into DistilBertTokenizer?

@cronoik
Copy link
Contributor

cronoik commented Feb 1, 2021

@rakesh4real What is /content/EsperBERTo? Which files are in this directory? Please keep in mind that Roberta uses a BPE tokenizer, while Bert a WordpieceTokenizer. You can't simply use different kinds of tokenization with the same configuration files.

@INF800
Copy link

INF800 commented Feb 2, 2021

Thank you @cronoik. I did not know tokenizers needed to be changed. Are there any references where I can learn what tokenizers must be used for a given model / task.

And I had to use different special tokens as well. Kindly let me know where to find what special tokens must be used (when and why)

Using BertWordpieceTokenizer the code runs just perfect. Added code here

@cronoik
Copy link
Contributor

cronoik commented Feb 16, 2021

@rakesh4real This site [1] gives you a general overview of different tokenization approaches and the site for each model tells you which tokenization algorithm was used (e.g. [2] for BERT).

[1] https://huggingface.co/transformers/tokenizer_summary.html
[2] https://huggingface.co/transformers/model_doc/bert.html#berttokenizer

@tolgayan
Copy link

tolgayan commented Aug 24, 2021

What do you mean with tokenizer model? The tokenizer in simple terms is a class which splits your text in tokens from a huge dictionary. What you have to train is the embedding layer of your model because the weights of the new tokens will be random. This will happen during the training of your model (but it could be undertrained for the new tokens).

In case you have a plenty of new words (e.g. technical terms) or even a different language, it might makes sense to start from scratch (definitely for the later). Here is blogpost from huggingface which shows you how to train a tokenizer+model for Esperanto: link. It really depends on your data (e.g. number of new tokens, importance of new tokens, relation between the tokens...).

As far as I understand there are two options mentioned. The first one is training from scratch using tokenizer.train(files, trainer). But this method requires training the Bert model from scratch too, as mentioned in #747. And the second option is extending the vocabulary as @cronoik said, but this leads to the problem @BramVanroy mentioned.

The options are either train from scratch or, randomly initialize embeddings of the tokenizer and hope for a good performance. Isn't it possible to finetune the model to train the embeddings of these newly added tokens? Why does it have to be either using random embeddings, or training from scratch? Am I missing something?

Thanks in advance.

@cronoik
Copy link
Contributor

cronoik commented Aug 28, 2021

The options are either train from scratch or, randomly initialize embeddings of the tokenizer and hope for a good performance. Isn't it possible to finetune the model to train the embeddings of these newly added tokens? Why does it have to be either using random embeddings, or training from scratch? Am I missing something?

resize_token_embeddings does not reset the embedding layer, it just extends it. The new tokens are randomly initialized and you need to train them:

  • In case you have only a few new tokens, you can do it during finetuning
  • In case you have a lot of new tokens, you should probably train your model with the pretraining objective that was used to train the model the first time. You might want to add an additional word_embedding layer for the new tokens and freeze all other layers to save some time.
  • in case you have a lot a lot new tokens (like a new language that is not related to the original language of your model), you should probably train a model from scratch.

@tolgayan

@tolgayan
Copy link

Thank you for the nice and clear explanation! @cronoik

@ma-batita
Copy link

Hello,
I know it is has been a long period since the last comment in this issue but I couldn't hold it and I have ti ask @cronoik.
Could you, please, explain more what do you mean by...

In case you have only a few new tokens, you can do it during finetuning

by 'during finetuning', you mean the new tokens will be randomly initialised first and then the embedding with update during the model training?

For my cas I have a list of emojis (all the emojis that we have so it is a size of 3,633 emojis) and the vocab_size of my tokenizer is 32005. does this make the 'few new tokens' of not? should I consider training my model from scratch?

thanks in advance!

@naarkhoo
Copy link

I still don't know how one can finetune tokenizer - by finetuning I don't mean just adding words to the dictionary - but also updating the embedding.

I am dealing with a text classification - since the text uses informal language (Arabic) e.g. salam, vs saloom or sssaam - a lot of vowels spell out differently. do I have to train a new language model from scrach ?! or I can use the existing model and finetune ?

@tolgayan
Copy link

tolgayan commented Jun 17, 2022

Tokenizers are nothing but seperators. They split the sentences into subparts. The most common splitting method is using the whitespaces. When we say "training a tokenizer", it actually creates a vocabulary from a given text data. It assigns an id to each token so that you can feed these tokens as numbers to a BERT model. When you tokenize a sentence with a so-called "pretrained" tokenizer, it splits the sentence with its splitting algorithm, and assigns ids to each token from its vocabulary. Sometimes, it encounters with unknown words. In this case, it further splits that word further to meaningful subparts, that the subparts are in the vocabulary. They generally looks like: "Hou## ##se". The purpose of this "training" operation is to prevent the tokenizer from splitting important or domain-specific tokens so that the meaning will be kept.

Back to your question. When you have some specific words that need to be in the vocabulary, you can directly add them to the vocabulary, and they will be assigned with ids, continuing from the last id in the vocabulary I guess (I would be happy if somebody verify me here.) But the main problem is your model does not know what to do with these new numbers. In this case, the embeddings will be created randomly for these tokens. Here you have three options as @cronoik suggested to train the embedding layer for these new tokens.

  • You can leave them, and while finetuning, the model figure out what to do with these new tokens by updating the embedding layer.
  • You can add a new embedding layer, and freeze all the previous layers. Then finetune the model with the same task of the base model so that the new layer will cover your new embeddings.
  • You can start from scratch, adding your tokens to the training corpus, initializing the tokenizer from ground, and pretrain a language model from scratch.

@don-tpanic
Copy link

Tokenizers are nothing but seperators. They split the sentences into subparts. The most common splitting method is using the whitespaces. When we say "training a tokenizer", it actually creates a vocabulary from a given text data. It assigns an id to each token so that you can feed these tokens as numbers to a BERT model. When you tokenize a sentence with a so-called "pretrained" tokenizer, it splits the sentence with its splitting algorithm, and assigns ids to each token from its vocabulary. Sometimes, it encounters with unknown words. In this case, it further splits that word further to meaningful subparts, that the subparts are in the vocabulary. They generally looks like: "Hou## ##se". The purpose of this "training" operation is to prevent the tokenizer from splitting important or domain-specific tokens so that the meaning will be kept.

Back to your question. When you have some specific words that need to be in the vocabulary, you can directly add them to the vocabulary, and they will be assigned with ids, continuing from the last id in the vocabulary I guess (I would be happy if somebody verify me here.) But the main problem is your model does not know what to do with these new numbers. In this case, the embeddings will be created randomly for these tokens. Here you have three options as @cronoik suggested to train the embedding layer for these new tokens.

  • You can leave them, and while finetuning, the model figure out what to do with these new tokens by updating the embedding layer.
  • You can add a new embedding layer, and freeze all the previous layers. Then finetune the model with the same task of the base model so that the new layer will cover your new embeddings.
  • You can start from scratch, adding your tokens to the training corpus, initializing the tokenizer from ground, and pretrain a language model from scratch.

Thanks a lot for your explanation. I suppose, if I go for the first approach where I fine-tune my embedding layer, it would be a good idea to fine-tune the entire embedding layer not just the newly added entries that correspond to my new tokens? Or perhaps I should only allow gradients for those newly added entries?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants