fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available #15319

SaulLu · 2022-01-24T18:36:43Z

What does this PR do?

Following the diagnosis discussed and validated in the issue #15283, this PR proposes to modify PreTrainedTokenizerBase so that the tokenizer_file is no longer retrieved if the calling tokenizer class is of a slow type.

This PR also contains different changes:

remove the key "tokenizer_file" from the global variables such as VOCAB_FILES_NAMES when it is a slow version or add it to the fast version when it was missing
remove the tokenizer_file argument from the init of some tokenizer slow
adapt the test_tokenizer_mismatch_warning test because now when someone tries to load files with the wrong tokenizer an error can be returned before the warning is run
add a new test

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Would love to have your feedbacks @LysandreJik and @sgugger

HuggingFaceDocBuilder · 2022-01-24T18:37:02Z

The documentation is not available anymore as the PR was closed or merged.

…n_utils_base.py`

sgugger

Thanks for working on this! I'm not entirely sure we should remove the filed from the global variables XXX_VOCAB_FILES_MAP as it would be breaking (we actually would like to remove those entirely if we could!)

Same for removing the argument from slow tokenizer. Can we just let it in the signature but do nothing with it?

SaulLu · 2022-01-28T11:23:30Z

Thank you so much for your review @sgugger !

Could you tell me more about why "I'm not entirely sure we should remove the filed from the global variables XXX_VOCAB_FILES_MAP as it would be breaking", I'm afraid I'm missing something. (Note, I'm only proposing to remove it from the global variables of the slow version of DPR - which also has a fast version - and from one of the possible slow tokenizer template)

About the signature of the slow tokenizer, this change was most about standardizing the code between the different slow tokenizer classes. This change concerns:

mbart if we leave tokenizer_file in the signature of mbart, potentially if this argument is given at the time of the initialization of the object, the info could be saved in the tokenizer_config.json (and result in the same problem as the one points out in the issue).
herbert: it's only for standartization. Here as the argument isn't passed to the __init__ of the super class (PreTrainedTokenizer) it can't be saved in the tokenizer_config.json.
all the other slow tokenizers don't have tokenizer_file in their signature

sgugger · 2022-01-28T13:04:54Z

You are removing content from a public constant, that is a breaking change. Same for changing the signature of tokenizers. I understand that for the second part, it could lead to bugs, so ok to break if it fixes something, but for the first change that is purely cosmetic, maybe we should avoid breaking?

cc @LysandreJik let us know what you think.

SaulLu · 2022-01-28T15:06:13Z

I understand your point! I still have a little trouble knowing where to draw the line between a bugfix and a breaking change.

LysandreJik · 2022-01-28T21:38:07Z

Agreed with @sgugger, but otherwise this looks like a very welcome change.

This reverts commit 0dbb723.

This reverts commit 5a3f879.

This reverts commit f5e1000.

This reverts commit da08953.

SaulLu · 2022-02-01T09:25:20Z

@sgugger, @LysandreJik , as adviced I have reverted my changes concerning global variables in slow files and changing signatures of the 2 slow tokeniers. 🙂

SaulLu · 2022-02-01T09:26:50Z

src/transformers/models/layoutxlm/tokenization_layoutxlm_fast.py

@@ -34,7 +34,7 @@
 )
 from ...tokenization_utils_fast import PreTrainedTokenizerFast
 from ...utils import logging
-from ..xlm_roberta.tokenization_xlm_roberta import (
+from ..xlm_roberta.tokenization_xlm_roberta_fast import (


Here, I think it's a bug fix because the fast version of LayoutXLMTokenizerFast needs the info from the fast files.

SaulLu · 2022-02-01T09:28:21Z

src/transformers/models/mbart/tokenization_mbart.py

@@ -110,7 +110,7 @@ def __init__(
            cls_token=cls_token,
            pad_token=pad_token,
            mask_token=mask_token,
-            tokenizer_file=tokenizer_file,
+            tokenizer_file=None,


I propose this change to avoid that the tokenizer_file argument is recorded in the tokenizer_config file

sgugger

Thanks for adapting your PR!

LysandreJik

Thank you for your changes, @SaulLu

…st version is available (huggingface#15319) * add new test * update test * remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py` * add `tokenizer_file` for the fast only tokenizer * change global variables layoutxml * remove `"tokenizer_file"` from DPR tokenizer's Global variables * remove `tokenizer_file` from herbert slow tokenizer init * `"tokenizer_file"` from LED tokenizer's Global variables * remove `tokenizer_file` from mbart slow tokenizer init * remove `tokenizer_file` from slow tokenizer template * adapt to versioning * adapt the `test_tokenizer_mismatch_warning` test * clean test * clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py * Revert "remove `tokenizer_file` from mbart slow tokenizer init" This reverts commit 0dbb723. * Revert "`"tokenizer_file"` from LED tokenizer's Global variables" This reverts commit 5a3f879. * Revert "remove `tokenizer_file` from herbert slow tokenizer init" This reverts commit f5e1000. * Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables" This reverts commit da08953. * set `tokenizer_file` in super `__init__` of mbart

SaulLu force-pushed the fix-load-slow-from-files-saved-with-fast branch from 95f2ae6 to 22340f0 Compare January 25, 2022 08:05

SaulLu mentioned this pull request Jan 25, 2022

improve saving strategy of sentencepiece tokenizer #15328

Merged

SaulLu force-pushed the fix-load-slow-from-files-saved-with-fast branch from a95630e to 1ffaeb7 Compare January 26, 2022 09:56

SaulLu mentioned this pull request Jan 26, 2022

Saved slow tokenizers cannot be loaded in AutoTokenizer after environment change #15283

Closed

SaulLu added 12 commits January 27, 2022 16:26

add new test

771d755

update test

def18c7

remove tokenizer_file from additional_files_names in `tokenizatio…

ea03b31

…n_utils_base.py`

add tokenizer_file for the fast only tokenizer

898ab36

change global variables layoutxml

59d16b4

remove "tokenizer_file" from DPR tokenizer's Global variables

da08953

remove tokenizer_file from herbert slow tokenizer init

f5e1000

"tokenizer_file" from LED tokenizer's Global variables

5a3f879

remove tokenizer_file from mbart slow tokenizer init

0dbb723

remove tokenizer_file from slow tokenizer template

4b20f37

adapt to versioning

fc4401d

adapt the test_tokenizer_mismatch_warning test

3455855

SaulLu force-pushed the fix-load-slow-from-files-saved-with-fast branch from 2169c68 to 3455855 Compare January 27, 2022 15:27

clean test

64f3ed6

SaulLu requested review from sgugger and LysandreJik and removed request for sgugger January 27, 2022 16:50

SaulLu changed the title ~~[WIP] Fix the tokenizer_config.json file for the slow tokenizer when a fast version is available~~ fix the tokenizer_config.json file for the slow tokenizer when a fast version is available Jan 27, 2022

SaulLu requested a review from sgugger January 27, 2022 17:10

sgugger reviewed Jan 27, 2022

View reviewed changes

clarify VOCAB_FILES_NAMES in tokenization_utils_fast.py

04c38d6

SaulLu added 5 commits February 1, 2022 10:13

Revert "remove tokenizer_file from mbart slow tokenizer init"

8389280

This reverts commit 0dbb723.

Revert ""tokenizer_file" from LED tokenizer's Global variables"

53fefe7

This reverts commit 5a3f879.

Revert "remove tokenizer_file from herbert slow tokenizer init"

33ff7f8

This reverts commit f5e1000.

Revert "remove "tokenizer_file" from DPR tokenizer's Global variables"

e78d06b

This reverts commit da08953.

set tokenizer_file in super __init__ of mbart

be07ba4

SaulLu commented Feb 1, 2022

View reviewed changes

Merge branch 'master' into fix-load-slow-from-files-saved-with-fast

dc19df9

sgugger approved these changes Feb 1, 2022

View reviewed changes

LysandreJik approved these changes Feb 1, 2022

View reviewed changes

SaulLu merged commit 7b8bdd8 into huggingface:master Feb 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available #15319

fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available #15319

SaulLu commented Jan 24, 2022 •

edited

Loading

HuggingFaceDocBuilder commented Jan 24, 2022 •

edited

Loading

sgugger left a comment

SaulLu commented Jan 28, 2022 •

edited

Loading

sgugger commented Jan 28, 2022

SaulLu commented Jan 28, 2022

LysandreJik commented Jan 28, 2022

SaulLu commented Feb 1, 2022

SaulLu Feb 1, 2022

SaulLu Feb 1, 2022

sgugger left a comment

LysandreJik left a comment

fix the tokenizer_config.json file for the slow tokenizer when a fast version is available #15319

fix the tokenizer_config.json file for the slow tokenizer when a fast version is available #15319

Conversation

SaulLu commented Jan 24, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilder commented Jan 24, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

SaulLu commented Jan 28, 2022 • edited Loading

sgugger commented Jan 28, 2022

SaulLu commented Jan 28, 2022

LysandreJik commented Jan 28, 2022

SaulLu commented Feb 1, 2022

SaulLu Feb 1, 2022

Choose a reason for hiding this comment

SaulLu Feb 1, 2022

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available #15319

fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available #15319

SaulLu commented Jan 24, 2022 •

edited

Loading

HuggingFaceDocBuilder commented Jan 24, 2022 •

edited

Loading

SaulLu commented Jan 28, 2022 •

edited

Loading