How to handle [UNUSED1] tokens in Bert huggingface training. #2

XuperX · 2024-02-26T18:07:11Z

How does [UNUSED1] tokens affect my model training?
" use the existing wordpiece vocab and run pre-trianing for more steps on the in-domain text, and it should learn the compositionality "for free" From Jacobdevlin
why there are only 100 tokens
how to map [unused1] to their lexical representation?
how [unused] tokens affect partial evaluation metrics.

how to import vocabularies to minimize unused tokens?
- modify the vocabulary of a pretrained LM on my domain-specific text dataset -> adjust the LM's embedding matrix to work with this new vocabulary size -> fine-tune the pretrained LM on my domain-speicific datast` From kumar
- this might be useful Learning a new workpiece vocabulary
- here is a paper handling unknown tokens
- The tokens can be added one by one. See details
```
from transformers import BertTokenizer
t = BertTokenizer.from_pretrained('bert-base-uncased')
print(t.tokenize("This is an example with an emoji 🤗."))

     t.add_tokens(['🤗'])
     print(t.tokenize("This is an example with an emoji 🤗."))

     #!!! remember to resize the token embeddings
     model.resize_token_embeddings(len(t))
     ```
     - But if you want to add more vocab you can either:
       (a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
       (b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle [UNUSED1] tokens in Bert huggingface training. #2

How to handle [UNUSED1] tokens in Bert huggingface training. #2

XuperX commented Feb 26, 2024 •

edited

Loading

How to handle [UNUSED1] tokens in Bert huggingface training. #2

How to handle [UNUSED1] tokens in Bert huggingface training. #2

Comments

XuperX commented Feb 26, 2024 • edited Loading

XuperX commented Feb 26, 2024 •

edited

Loading