Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will luke support fast tokenizer #170

Open
TrickyyH opened this issue Nov 28, 2022 · 3 comments
Open

Will luke support fast tokenizer #170

TrickyyH opened this issue Nov 28, 2022 · 3 comments

Comments

@TrickyyH
Copy link

Hello everyone, I am tring to use luke-large for question answering.
I met serveral issues when finetune the model by SQAUD-like data, most of the issues comes by not supporting fast tokenizer.
So I am wondering if luke will support fast tokenizer in the future, or is any ways to solve the issues.
Thank you so much!

@abebe9849
Copy link

Hi!
If you refer to the following blog, it seems that offset_mapping can be used with LUKE. It has not been confirmed whether misalignment does not occur at any time. sorry

https://srad.jp/~yasuoka/journal/651897/

@tealgreen0503
Copy link

I thought the same as @TrickyyH. Apart from offset_mapping, for instance, the behaviour of return_overflowing_tokens differs between slow and fast tokenisers. As a result, it becomes difficult to handle long texts in tasks like NER and QA, which LUKE excels at. I would be pleased if you could accommodate the fast tokeniser.

@ryokan0123
Copy link
Contributor

One possible workaround is to use the fast version of the base tokenizer, such as the Fast version of RobertaTokenizer, which LukeTokenizer is based on 'they have the same subword vocabulary).

However, this approach may not support entity-related outputs, which would require additional code to be written.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants