Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the tokenizer_config.json file for the slow tokenizer when a fast version is available #15319

Merged

Conversation

SaulLu
Copy link
Contributor

@SaulLu SaulLu commented Jan 24, 2022

What does this PR do?

Following the diagnosis discussed and validated in the issue #15283, this PR proposes to modify PreTrainedTokenizerBase so that the tokenizer_file is no longer retrieved if the calling tokenizer class is of a slow type.

This PR also contains different changes:

  • remove the key "tokenizer_file" from the global variables such as VOCAB_FILES_NAMES when it is a slow version or add it to the fast version when it was missing
  • remove the tokenizer_file argument from the init of some tokenizer slow
  • adapt the test_tokenizer_mismatch_warning test because now when someone tries to load files with the wrong tokenizer an error can be returned before the warning is run
  • add a new test

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Would love to have your feedbacks @LysandreJik and @sgugger

@HuggingFaceDocBuilder
Copy link

HuggingFaceDocBuilder commented Jan 24, 2022

The documentation is not available anymore as the PR was closed or merged.

@SaulLu SaulLu force-pushed the fix-load-slow-from-files-saved-with-fast branch from 2169c68 to 3455855 Compare January 27, 2022 15:27
@SaulLu SaulLu requested review from sgugger and LysandreJik and removed request for sgugger January 27, 2022 16:50
@SaulLu SaulLu changed the title [WIP] Fix the tokenizer_config.json file for the slow tokenizer when a fast version is available fix the tokenizer_config.json file for the slow tokenizer when a fast version is available Jan 27, 2022
@SaulLu SaulLu requested a review from sgugger January 27, 2022 17:10
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I'm not entirely sure we should remove the filed from the global variables XXX_VOCAB_FILES_MAP as it would be breaking (we actually would like to remove those entirely if we could!)

Same for removing the argument from slow tokenizer. Can we just let it in the signature but do nothing with it?

@SaulLu
Copy link
Contributor Author

SaulLu commented Jan 28, 2022

Thank you so much for your review @sgugger !

Could you tell me more about why "I'm not entirely sure we should remove the filed from the global variables XXX_VOCAB_FILES_MAP as it would be breaking", I'm afraid I'm missing something. (Note, I'm only proposing to remove it from the global variables of the slow version of DPR - which also has a fast version - and from one of the possible slow tokenizer template)

About the signature of the slow tokenizer, this change was most about standardizing the code between the different slow tokenizer classes. This change concerns:

  • mbart if we leave tokenizer_file in the signature of mbart, potentially if this argument is given at the time of the initialization of the object, the info could be saved in the tokenizer_config.json (and result in the same problem as the one points out in the issue).
  • herbert: it's only for standartization. Here as the argument isn't passed to the __init__ of the super class (PreTrainedTokenizer) it can't be saved in the tokenizer_config.json.
  • all the other slow tokenizers don't have tokenizer_file in their signature

@sgugger
Copy link
Collaborator

sgugger commented Jan 28, 2022

You are removing content from a public constant, that is a breaking change. Same for changing the signature of tokenizers. I understand that for the second part, it could lead to bugs, so ok to break if it fixes something, but for the first change that is purely cosmetic, maybe we should avoid breaking?

cc @LysandreJik let us know what you think.

@SaulLu
Copy link
Contributor Author

SaulLu commented Jan 28, 2022

I understand your point! I still have a little trouble knowing where to draw the line between a bugfix and a breaking change.

@LysandreJik
Copy link
Member

Agreed with @sgugger, but otherwise this looks like a very welcome change.

@SaulLu
Copy link
Contributor Author

SaulLu commented Feb 1, 2022

@sgugger, @LysandreJik , as adviced I have reverted my changes concerning global variables in slow files and changing signatures of the 2 slow tokeniers. 🙂

@@ -34,7 +34,7 @@
)
from ...tokenization_utils_fast import PreTrainedTokenizerFast
from ...utils import logging
from ..xlm_roberta.tokenization_xlm_roberta import (
from ..xlm_roberta.tokenization_xlm_roberta_fast import (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I think it's a bug fix because the fast version of LayoutXLMTokenizerFast needs the info from the fast files.

@@ -110,7 +110,7 @@ def __init__(
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
tokenizer_file=tokenizer_file,
tokenizer_file=None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose this change to avoid that the tokenizer_file argument is recorded in the tokenizer_config file

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adapting your PR!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your changes, @SaulLu

@SaulLu SaulLu merged commit 7b8bdd8 into huggingface:master Feb 1, 2022
gante pushed a commit to gante/transformers that referenced this pull request Feb 1, 2022
…st version is available (huggingface#15319)

* add new test

* update test

* remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`

* add `tokenizer_file` for the fast only tokenizer

* change global variables layoutxml

* remove `"tokenizer_file"` from DPR tokenizer's Global variables

* remove `tokenizer_file` from herbert slow tokenizer init

* `"tokenizer_file"` from LED tokenizer's Global variables

* remove `tokenizer_file` from mbart slow tokenizer init

* remove `tokenizer_file` from slow tokenizer template

* adapt to versioning

* adapt the `test_tokenizer_mismatch_warning` test

* clean test

* clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py

* Revert "remove `tokenizer_file` from mbart slow tokenizer init"

This reverts commit 0dbb723.

* Revert "`"tokenizer_file"` from LED tokenizer's Global variables"

This reverts commit 5a3f879.

* Revert "remove `tokenizer_file` from herbert slow tokenizer init"

This reverts commit f5e1000.

* Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"

This reverts commit da08953.

* set `tokenizer_file` in super `__init__` of mbart
ManuelFay pushed a commit to ManuelFay/transformers that referenced this pull request Mar 31, 2022
…st version is available (huggingface#15319)

* add new test

* update test

* remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`

* add `tokenizer_file` for the fast only tokenizer

* change global variables layoutxml

* remove `"tokenizer_file"` from DPR tokenizer's Global variables

* remove `tokenizer_file` from herbert slow tokenizer init

* `"tokenizer_file"` from LED tokenizer's Global variables

* remove `tokenizer_file` from mbart slow tokenizer init

* remove `tokenizer_file` from slow tokenizer template

* adapt to versioning

* adapt the `test_tokenizer_mismatch_warning` test

* clean test

* clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py

* Revert "remove `tokenizer_file` from mbart slow tokenizer init"

This reverts commit 0dbb723.

* Revert "`"tokenizer_file"` from LED tokenizer's Global variables"

This reverts commit 5a3f879.

* Revert "remove `tokenizer_file` from herbert slow tokenizer init"

This reverts commit f5e1000.

* Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"

This reverts commit da08953.

* set `tokenizer_file` in super `__init__` of mbart
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants