How can I change vocab size for pretrained model? #237

hahmyg · 2019-01-30T05:10:37Z

Is there way to change (expand) vocab size for pretrained model?

When I input the new token id to model, it returns:

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1108 with torch.no_grad():
1109 torch.embedding_renorm_(weight, input, max_norm, norm_type)
-> 1110 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1111
1112

RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorMath.cpp:352

rodgzilla · 2019-01-30T06:37:02Z

Hi,

If you want to modify the vocabulary, you should refer to this part of the original repo README https://github.com/google-research/bert#learning-a-new-wordpiece-vocabulary

tholor · 2019-01-30T07:17:30Z

If you don't want a complete new vocabulary (which would require training from scratch), but extend the pretrained one with a couple of domain specific tokens, this comment from Jacob Devlin might help:

[...] if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

(google-research/bert#9)

I am currently experimenting with approach a). Since there are 993 unused tokens this might already help for the most important tokens in your domain.

thomwolf · 2019-02-05T16:14:19Z

@tholor and @rodgzilla answers are the way to go.
Closing this issue since there no activity.
Feel free to re-open if needed.

chenshaolong · 2019-03-20T21:42:40Z

If you don't want a complete new vocabulary (which would require training from scratch), but extend the pretrained one with a couple of domain specific tokens, this comment from Jacob Devlin might help:

[...] if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

(google-research/bert#9)

I am currently experimenting with approach a). Since there are 993 unused tokens this might already help for the most important tokens in your domain.

@tholor I have exactly the same situation as you had. I'm wondering If you can tell me how your experiment with approach (a) went. Did it improve the accuracy. I really appreciate if you can share your conclusion.

vyraun · 2019-10-03T13:07:05Z

@tholor and @rodgzilla answers are the way to go.
Closing this issue since there no activity.
Feel free to re-open if needed.

Hi @thomwolf , for implementing models like VideoBERT we need to append thousands of entries to the word embedding lookup table. How could we do so in Pytorch/any such examples using the library?

sachinshinde1391 · 2020-01-07T22:59:10Z

@tholor Can you guide me on how you are counting 993 unused tokens? I see only first 100 places of unused tokens?

aribenjamin · 2023-07-17T18:54:15Z

For those finding this on the web, I found the following answer helpful: #1413 (comment)

thomwolf closed this as completed Feb 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I change vocab size for pretrained model? #237

How can I change vocab size for pretrained model? #237

hahmyg commented Jan 30, 2019

rodgzilla commented Jan 30, 2019

tholor commented Jan 30, 2019

thomwolf commented Feb 5, 2019

chenshaolong commented Mar 20, 2019

vyraun commented Oct 3, 2019

sachinshinde1391 commented Jan 7, 2020

aribenjamin commented Jul 17, 2023

How can I change vocab size for pretrained model? #237

How can I change vocab size for pretrained model? #237

Comments

hahmyg commented Jan 30, 2019

rodgzilla commented Jan 30, 2019

tholor commented Jan 30, 2019

thomwolf commented Feb 5, 2019

chenshaolong commented Mar 20, 2019

vyraun commented Oct 3, 2019

sachinshinde1391 commented Jan 7, 2020

aribenjamin commented Jul 17, 2023