-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I change vocab size for pretrained model? #237
Comments
Hi, If you want to modify the vocabulary, you should refer to this part of the original repo |
If you don't want a complete new vocabulary (which would require training from scratch), but extend the pretrained one with a couple of domain specific tokens, this comment from Jacob Devlin might help:
I am currently experimenting with approach a). Since there are 993 unused tokens this might already help for the most important tokens in your domain. |
@tholor and @rodgzilla answers are the way to go. |
@tholor I have exactly the same situation as you had. I'm wondering If you can tell me how your experiment with approach (a) went. Did it improve the accuracy. I really appreciate if you can share your conclusion. |
Hi @thomwolf , for implementing models like VideoBERT we need to append thousands of entries to the word embedding lookup table. How could we do so in Pytorch/any such examples using the library? |
@tholor Can you guide me on how you are counting 993 unused tokens? I see only first 100 places of unused tokens? |
For those finding this on the web, I found the following answer helpful: #1413 (comment) |
Is there way to change (expand) vocab size for pretrained model?
When I input the new token id to model, it returns:
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1108 with torch.no_grad():
1109 torch.embedding_renorm_(weight, input, max_norm, norm_type)
-> 1110 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1111
1112
RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorMath.cpp:352
The text was updated successfully, but these errors were encountered: