-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQuAD update #1232
SQuAD update #1232
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1232 +/- ##
==========================================
- Coverage 56.57% 56.56% -0.02%
==========================================
Files 147 147
Lines 10578 10582 +4
==========================================
+ Hits 5985 5986 +1
- Misses 4593 4596 +3
Continue to review full report at Codecov.
|
@@ -21,6 +21,9 @@ | |||
|
|||
import logging | |||
|
|||
# Store the tokenizers which insert 2 separators tokens | |||
MULTI_SEP_TOKENS_TOKENIZERS_SET = {"roberta", "camembert", "bart"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on moving tokenizer specific wrapping constants to a dedicated file (ext/tokenizers/constants.py or similar)? Worried about the logic being scattered throughout files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant comes from Transformers: https://github.com/huggingface/transformers/blob/969859d5f67c7106de4d1098c4891c9b03694bbe/src/transformers/data/processors/squad.py#L17
Unfortunately, it comes in a later version than what we're currently requiring, so we can't currently import it (unless we bump the version, I'd like to wait until after Fall to do it because of the v3.0.2 issue). The upside is that it's from the same file that all the rest of the SQuAD code comes from, so it'll be consistent for anyone looking up update the SQuAD implementation in the future.
The summary for 2 is that previously (including in the Hugging Face implementation), the context was being tokenized per-word rather than as a string. This is problematic for tokenizers that treat start-of-word tokens differently.
Although the cited PR states that there is a large performance improvement from this fix, I did not observe this myself in my testing (both in
jiant
as well as in thetransformers
examples). Both versions appeared to perform comparably.For
jiant
users, this will impact tokenization, but training a fresh model for either old or new tokenization should both work. I recommended deleting and recreating any SQuAD-based caches. However, if you consistently use only the old tokenization caches, that should be fine too.