Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeBERTa Fast Tokenizer #10498

Closed
brandenchan opened this issue Mar 3, 2021 · 9 comments · Fixed by #11387
Closed

DeBERTa Fast Tokenizer #10498

brandenchan opened this issue Mar 3, 2021 · 9 comments · Fixed by #11387
Labels
Good First Issue Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@brandenchan
Copy link
Contributor

brandenchan commented Mar 3, 2021

Hi, I am interested in using the DeBERTa model that was recently implemented here and incorporating it into FARM so that it can also be used in open-domain QA settings through Haystack.

Just wondering why there's only a Slow Tokenizer implemented for DeBERTa and wondering if there are plans to create the Fast Tokenizer too. Thanks in advance!

Hi @stefan-it! Wondering if you might have any insight on this?

@stefan-it
Copy link
Collaborator

Hi @brandenchan ,

I think it should be easier with version 2 of DeBERTa, because they use a "normal" sentence piece model now:

#10018

So having a fast alternative would be great.

(The new 128k vocab size should really boost performance on QA tasks!)

@LysandreJik
Copy link
Member

Indeed, this would be a very nice addition and way easier to implement than for the first DeBERTa. I'm adding the Good Second Issue label so that a community member may work on it. @brandenchan or @stefan-it feel free to take it too if you feel like it!

@LysandreJik LysandreJik added Good First Issue Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! labels Mar 3, 2021
@ShubhamSanghvi
Copy link
Contributor

Hi, I am looking for my first open source contribution. May I take this if its still available?

@LysandreJik
Copy link
Member

Yes, of course! Thank you!

@cronoik
Copy link
Contributor

cronoik commented Mar 14, 2021

@ShubhamSanghvi Maybe wait until #10703 is merged.

@ShubhamSanghvi
Copy link
Contributor

Hi, as far as I understand I will have to add tokenizer files for debarta_v2 to implement the fast tokenizer?

May I know how could I get the tokenizer files for deberta_v2 models and how to upload them to the intended destinations, which I believe should be (for deberta-v2-xlarge) :

https://huggingface.co/microsoft/deberta-v2-xlarge/resolve/main/

Thanks, Shubham

@cronoik
Copy link
Contributor

cronoik commented Mar 31, 2021

@ShubhamSanghvi Do you only want to implement the fast tokenizer for DebertaV2 or also for Deberta?

May I know how could I get the tokenizer files for deberta_v2 models

I think this is what you have to figure out. I would check the other models that have a slow sentencepiece tokenizer.

how to upload them to the intended destinations, which I believe should be (for deberta-v2-xlarge)

You can not upload them there. Upload them to some kind of a public cloud and request an upload.

@mansimane
Copy link
Contributor

@ShubhamSanghvi Are you planning to create a PR for this issue soon?

@ShubhamSanghvi
Copy link
Contributor

Hi @mansimane, I am currently working on it. I am hoping to get it done by next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants