Train tokenizer for Deberta #10723

avacaondata · 2021-03-15T16:58:02Z

Hi, I would like to know how can I train a DeBERTa tokenizer. From the paper I saw it uses BPETokenizer, but the BPETokenizer from huggingface/tokenizers doesn't work for this. Could you recommend me another implementation or library or a correct configuration for huggingface/tokenizers implementation to be able to train a DeBERTa model from scratch?

NielsRogge · 2021-03-15T18:33:02Z

HuggingFace has another library called tokenizers especially for this.

cronoik · 2021-03-15T23:22:35Z

Currently, the training of Deberta Tokenizer is not supported directly by huggingface. Of course, you can create the required files by yourself from BPETokenizer training output, but you could also simply wait until #10703 is merged into the master branch and released. :-)

avacaondata · 2021-03-16T18:03:26Z

How would be the process of creating the required files from the BPETokenizer training output? @cronoik I'd really appreciate a little bit of explanation, as I tried to do so and I failed.

cronoik · 2021-03-18T18:08:52Z

You can save me a lot of time by simply using the mentioned patch above. Just copy the DebertaTokenizer class to your runtime.

github-actions · 2021-04-15T15:02:42Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Apr 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train tokenizer for Deberta #10723

Train tokenizer for Deberta #10723

avacaondata commented Mar 15, 2021

NielsRogge commented Mar 15, 2021 •

edited

Loading

cronoik commented Mar 15, 2021 •

edited

Loading

avacaondata commented Mar 16, 2021

cronoik commented Mar 18, 2021

github-actions bot commented Apr 15, 2021

Train tokenizer for Deberta #10723

Train tokenizer for Deberta #10723

Comments

avacaondata commented Mar 15, 2021

NielsRogge commented Mar 15, 2021 • edited Loading

cronoik commented Mar 15, 2021 • edited Loading

avacaondata commented Mar 16, 2021

cronoik commented Mar 18, 2021

github-actions bot commented Apr 15, 2021

NielsRogge commented Mar 15, 2021 •

edited

Loading

cronoik commented Mar 15, 2021 •

edited

Loading