-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix the tokenizer_config.json
file for the slow tokenizer when a fast version is available
#15319
fix the tokenizer_config.json
file for the slow tokenizer when a fast version is available
#15319
Conversation
The documentation is not available anymore as the PR was closed or merged. |
95f2ae6
to
22340f0
Compare
a95630e
to
1ffaeb7
Compare
…n_utils_base.py`
2169c68
to
3455855
Compare
tokenizer_config.json
file for the slow tokenizer when a fast version is availabletokenizer_config.json
file for the slow tokenizer when a fast version is available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I'm not entirely sure we should remove the filed from the global variables XXX_VOCAB_FILES_MAP as it would be breaking (we actually would like to remove those entirely if we could!)
Same for removing the argument from slow tokenizer. Can we just let it in the signature but do nothing with it?
Thank you so much for your review @sgugger ! Could you tell me more about why "I'm not entirely sure we should remove the filed from the global variables XXX_VOCAB_FILES_MAP as it would be breaking", I'm afraid I'm missing something. (Note, I'm only proposing to remove it from the global variables of the slow version of DPR - which also has a fast version - and from one of the possible slow tokenizer template) About the signature of the slow tokenizer, this change was most about standardizing the code between the different slow tokenizer classes. This change concerns:
|
You are removing content from a public constant, that is a breaking change. Same for changing the signature of tokenizers. I understand that for the second part, it could lead to bugs, so ok to break if it fixes something, but for the first change that is purely cosmetic, maybe we should avoid breaking? cc @LysandreJik let us know what you think. |
I understand your point! I still have a little trouble knowing where to draw the line between a bugfix and a breaking change. |
Agreed with @sgugger, but otherwise this looks like a very welcome change. |
@sgugger, @LysandreJik , as adviced I have reverted my changes concerning global variables in slow files and changing signatures of the 2 slow tokeniers. 🙂 |
@@ -34,7 +34,7 @@ | |||
) | |||
from ...tokenization_utils_fast import PreTrainedTokenizerFast | |||
from ...utils import logging | |||
from ..xlm_roberta.tokenization_xlm_roberta import ( | |||
from ..xlm_roberta.tokenization_xlm_roberta_fast import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I think it's a bug fix because the fast version of LayoutXLMTokenizerFast
needs the info from the fast files.
@@ -110,7 +110,7 @@ def __init__( | |||
cls_token=cls_token, | |||
pad_token=pad_token, | |||
mask_token=mask_token, | |||
tokenizer_file=tokenizer_file, | |||
tokenizer_file=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose this change to avoid that the tokenizer_file
argument is recorded in the tokenizer_config file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adapting your PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your changes, @SaulLu
…st version is available (huggingface#15319) * add new test * update test * remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py` * add `tokenizer_file` for the fast only tokenizer * change global variables layoutxml * remove `"tokenizer_file"` from DPR tokenizer's Global variables * remove `tokenizer_file` from herbert slow tokenizer init * `"tokenizer_file"` from LED tokenizer's Global variables * remove `tokenizer_file` from mbart slow tokenizer init * remove `tokenizer_file` from slow tokenizer template * adapt to versioning * adapt the `test_tokenizer_mismatch_warning` test * clean test * clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py * Revert "remove `tokenizer_file` from mbart slow tokenizer init" This reverts commit 0dbb723. * Revert "`"tokenizer_file"` from LED tokenizer's Global variables" This reverts commit 5a3f879. * Revert "remove `tokenizer_file` from herbert slow tokenizer init" This reverts commit f5e1000. * Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables" This reverts commit da08953. * set `tokenizer_file` in super `__init__` of mbart
…st version is available (huggingface#15319) * add new test * update test * remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py` * add `tokenizer_file` for the fast only tokenizer * change global variables layoutxml * remove `"tokenizer_file"` from DPR tokenizer's Global variables * remove `tokenizer_file` from herbert slow tokenizer init * `"tokenizer_file"` from LED tokenizer's Global variables * remove `tokenizer_file` from mbart slow tokenizer init * remove `tokenizer_file` from slow tokenizer template * adapt to versioning * adapt the `test_tokenizer_mismatch_warning` test * clean test * clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py * Revert "remove `tokenizer_file` from mbart slow tokenizer init" This reverts commit 0dbb723. * Revert "`"tokenizer_file"` from LED tokenizer's Global variables" This reverts commit 5a3f879. * Revert "remove `tokenizer_file` from herbert slow tokenizer init" This reverts commit f5e1000. * Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables" This reverts commit da08953. * set `tokenizer_file` in super `__init__` of mbart
What does this PR do?
Following the diagnosis discussed and validated in the issue #15283, this PR proposes to modify
PreTrainedTokenizerBase
so that thetokenizer_file
is no longer retrieved if the calling tokenizer class is of a slow type.This PR also contains different changes:
"tokenizer_file"
from the global variables such asVOCAB_FILES_NAMES
when it is a slow version or add it to the fast version when it was missingtokenizer_file
argument from the init of some tokenizer slowtest_tokenizer_mismatch_warning
test because now when someone tries to load files with the wrong tokenizer an error can be returned before the warning is runBefore submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Would love to have your feedbacks @LysandreJik and @sgugger