Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train tokenizer for Deberta #10723

Closed
avacaondata opened this issue Mar 15, 2021 · 5 comments
Closed

Train tokenizer for Deberta #10723

avacaondata opened this issue Mar 15, 2021 · 5 comments

Comments

@avacaondata
Copy link

Hi, I would like to know how can I train a DeBERTa tokenizer. From the paper I saw it uses BPETokenizer, but the BPETokenizer from huggingface/tokenizers doesn't work for this. Could you recommend me another implementation or library or a correct configuration for huggingface/tokenizers implementation to be able to train a DeBERTa model from scratch?

@NielsRogge
Copy link
Contributor

NielsRogge commented Mar 15, 2021

HuggingFace has another library called tokenizers especially for this.

@cronoik
Copy link
Contributor

cronoik commented Mar 15, 2021

Currently, the training of Deberta Tokenizer is not supported directly by huggingface. Of course, you can create the required files by yourself from BPETokenizer training output, but you could also simply wait until #10703 is merged into the master branch and released. :-)

@avacaondata
Copy link
Author

How would be the process of creating the required files from the BPETokenizer training output? @cronoik I'd really appreciate a little bit of explanation, as I tried to do so and I failed.

@cronoik
Copy link
Contributor

cronoik commented Mar 18, 2021

You can save me a lot of time by simply using the mentioned patch above. Just copy the DebertaTokenizer class to your runtime.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants