Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle [UNUSED1] tokens in Bert huggingface training. #2

Open
1 of 5 tasks
XuperX opened this issue Feb 26, 2024 · 0 comments
Open
1 of 5 tasks

How to handle [UNUSED1] tokens in Bert huggingface training. #2

XuperX opened this issue Feb 26, 2024 · 0 comments

Comments

@XuperX
Copy link
Owner

XuperX commented Feb 26, 2024

  • How does [UNUSED1] tokens affect my model training?
    " use the existing wordpiece vocab and run pre-trianing for more steps on the in-domain text, and it should learn the compositionality "for free" From Jacobdevlin

  • why there are only 100 tokens

  • how to map [unused1] to their lexical representation?

  • how [unused] tokens affect partial evaluation metrics.

  • how to import vocabularies to minimize unused tokens?
    - modify the vocabulary of a pretrained LM on my domain-specific text dataset -> adjust the LM's embedding matrix to work with this new vocabulary size -> fine-tune the pretrained LM on my domain-speicific datast` From kumar
    - this might be useful Learning a new workpiece vocabulary
    - here is a paper handling unknown tokens
    - The tokens can be added one by one. See details
    ```
    from transformers import BertTokenizer
    t = BertTokenizer.from_pretrained('bert-base-uncased')
    print(t.tokenize("This is an example with an emoji 🤗."))

         t.add_tokens(['🤗'])
         print(t.tokenize("This is an example with an emoji 🤗."))
    
         #!!! remember to resize the token embeddings
         model.resize_token_embeddings(len(t))
         ```
         - But if you want to add more vocab you can either:
           (a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
           (b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.
    
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant