-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how can i finetune BertTokenizer? #2691
Comments
You can add new words to the tokenizer with add_tokens: |
Note that this simply adds a new token to the vocabulary but doesn't train its embedding (obviously). This implies that your results will be quite poor if your training data contains a lot of newly added (untrained) tokens. |
@cronoik once the dictionary is resized don't I have to train the tokenizer model again? @BramVanroy umm.. so what could be the probable solution if I am having a custom data set? How can I can retrain this BertTokenizer Model to get new vocab.txt file? |
What do you mean with tokenizer model? The tokenizer in simple terms is a class which splits your text in tokens from a huge dictionary. What you have to train is the embedding layer of your model because the weights of the new tokens will be random. This will happen during the training of your model (but it could be undertrained for the new tokens). In case you have a plenty of new words (e.g. technical terms) or even a different language, it might makes sense to start from scratch (definitely for the later). Here is blogpost from huggingface which shows you how to train a tokenizer+model for Esperanto: link. It really depends on your data (e.g. number of new tokens, importance of new tokens, relation between the tokens...). |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi @cronoik, I tried replacing
worked absolutely fine. But,
throws error:
Can I know what is the best way to add vocabulary into DistilBertTokenizer? |
@rakesh4real What is |
Thank you @cronoik. I did not know tokenizers needed to be changed. Are there any references where I can learn what tokenizers must be used for a given model / task. And I had to use different special tokens as well. Kindly let me know where to find what special tokens must be used (when and why) Using |
@rakesh4real This site [1] gives you a general overview of different tokenization approaches and the site for each model tells you which tokenization algorithm was used (e.g. [2] for BERT). [1] https://huggingface.co/transformers/tokenizer_summary.html |
As far as I understand there are two options mentioned. The first one is training from scratch using The options are either train from scratch or, randomly initialize embeddings of the tokenizer and hope for a good performance. Isn't it possible to finetune the model to train the embeddings of these newly added tokens? Why does it have to be either using random embeddings, or training from scratch? Am I missing something? Thanks in advance. |
resize_token_embeddings does not reset the embedding layer, it just extends it. The new tokens are randomly initialized and you need to train them:
|
Thank you for the nice and clear explanation! @cronoik |
Hello,
by 'during finetuning', you mean the new tokens will be randomly initialised first and then the embedding with update during the model training? For my cas I have a list of emojis (all the emojis that we have so it is a size of 3,633 emojis) and the vocab_size of my tokenizer is 32005. does this make the 'few new tokens' of not? should I consider training my model from scratch? thanks in advance! |
I still don't know how one can finetune tokenizer - by finetuning I don't mean just adding words to the dictionary - but also updating the embedding. I am dealing with a text classification - since the text uses informal language (Arabic) e.g. |
Tokenizers are nothing but seperators. They split the sentences into subparts. The most common splitting method is using the whitespaces. When we say "training a tokenizer", it actually creates a vocabulary from a given text data. It assigns an id to each token so that you can feed these tokens as numbers to a BERT model. When you tokenize a sentence with a so-called "pretrained" tokenizer, it splits the sentence with its splitting algorithm, and assigns ids to each token from its vocabulary. Sometimes, it encounters with unknown words. In this case, it further splits that word further to meaningful subparts, that the subparts are in the vocabulary. They generally looks like: "Hou## ##se". The purpose of this "training" operation is to prevent the tokenizer from splitting important or domain-specific tokens so that the meaning will be kept. Back to your question. When you have some specific words that need to be in the vocabulary, you can directly add them to the vocabulary, and they will be assigned with ids, continuing from the last id in the vocabulary I guess (I would be happy if somebody verify me here.) But the main problem is your model does not know what to do with these new numbers. In this case, the embeddings will be created randomly for these tokens. Here you have three options as @cronoik suggested to train the embedding layer for these new tokens.
|
Thanks a lot for your explanation. I suppose, if I go for the first approach where I fine-tune my embedding layer, it would be a good idea to fine-tune the entire embedding layer not just the newly added entries that correspond to my new tokens? Or perhaps I should only allow gradients for those newly added entries? |
Is it possible to fine tune BertTokenizer so that the new vocab.txt file which it uses gets updated on my custom dataset? or do i need to retrain the bert model from scratch for the same?
The text was updated successfully, but these errors were encountered: