You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How does [UNUSED1] tokens affect my model training? " use the existing wordpiece vocab and run pre-trianing for more steps on the in-domain text, and it should learn the compositionality "for free"From Jacobdevlin
why there are only 100 tokens
how to map [unused1] to their lexical representation?
how [unused] tokens affect partial evaluation metrics.
how to import vocabularies to minimize unused tokens?
- modify the vocabulary of a pretrained LM on my domain-specific text dataset -> adjust the LM's embedding matrix to work with this new vocabulary size -> fine-tune the pretrained LM on my domain-speicific datast` From kumar
- this might be useful Learning a new workpiece vocabulary
- here is a paper handling unknown tokens
- The tokens can be added one by one. See details
```
from transformers import BertTokenizer
t = BertTokenizer.from_pretrained('bert-base-uncased')
print(t.tokenize("This is an example with an emoji 🤗."))
t.add_tokens(['🤗'])
print(t.tokenize("This is an example with an emoji 🤗."))
#!!! remember to resize the token embeddings
model.resize_token_embeddings(len(t))
```
- But if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.
The text was updated successfully, but these errors were encountered:
How does [UNUSED1] tokens affect my model training?
" use the existing wordpiece vocab and run pre-trianing for more steps on the in-domain text, and it should learn the compositionality "for free"
From Jacobdevlinwhy there are only 100 tokens
how to map [unused1] to their lexical representation?
how [unused] tokens affect partial evaluation metrics.
how to import vocabularies to minimize unused tokens?
-
modify the vocabulary of a pretrained LM on my domain-specific text dataset
->adjust the LM's embedding matrix to work with this new vocabulary size
-> fine-tune the pretrained LM on my domain-speicific datast` From kumar- this might be useful Learning a new workpiece vocabulary
- here is a paper handling unknown tokens
- The tokens can be added one by one. See details
```
from transformers import BertTokenizer
t = BertTokenizer.from_pretrained('bert-base-uncased')
print(t.tokenize("This is an example with an emoji 🤗."))
The text was updated successfully, but these errors were encountered: