Warning about too long input for fast tokenizers too #8799

Narsil · 2020-11-26T11:35:12Z

What does this PR do?

If truncation is not set in tokenizers, but the tokenization is too long
for the model (model_max_length), we used to trigger a warning that

The input would probably fail (which it most likely will).

This PR re-enables the warning for fast tokenizers too and uses common code
for the trigger to make sure it's consistent across.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

@LysandreJik
@thomwolf

Narsil · 2020-11-26T12:58:07Z

Failing tests seem to come from some other code (seq2seq)

LysandreJik · 2020-11-27T17:36:51Z

src/transformers/tokenization_utils_base.py

        """
-        Clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.
+        clean up a list of simple english tokenization artifacts like spaces before punctuations and abbreviated forms.

-        Args:
-            out_string (:obj:`str`): The text to clean up.
+        args: out_string (:obj:`str`): the text to clean up.

-        Returns:
-            :obj:`str`: The cleaned-up string.
+        returns: :obj:`str`: the cleaned-up string.


The docstring was in the correct style before the changes

But I simply ran the documentation fixer :(

LysandreJik · 2020-11-27T17:37:59Z

@thomwolf could you review this PR as you're the mastermind behind this code?

thomwolf

LGTM

If truncation is not set in tokenizers, but the tokenization is too long for the model (`model_max_length`), we used to trigger a warning that The input would probably fail (which it most likely will). This PR re-enables the warning for fast tokenizers too and uses common code for the trigger to make sure it's consistent across.

Narsil · 2020-12-02T10:13:15Z

@LysandreJik May I merge (failing tests and quality is linked to unrelated finetune.py code, I tried to rebase but it does not seem to be enough)

Narsil requested review from thomwolf and LysandreJik and removed request for thomwolf November 26, 2020 12:57

LysandreJik reviewed Nov 27, 2020

View reviewed changes

LysandreJik requested a review from thomwolf November 27, 2020 17:38

thomwolf approved these changes Dec 2, 2020

View reviewed changes

Narsil added 4 commits December 2, 2020 11:04

Checking for pair of inputs too.

8936f1d

Making the function private and adding it's doc.

2276382

Remove formatting ?? in odd place.

a4ecb3a

Narsil force-pushed the warning_for_too_long_input branch from 201db80 to a4ecb3a Compare December 2, 2020 10:04

Missed uppercase.

2b6e109

LysandreJik approved these changes Dec 2, 2020

View reviewed changes

LysandreJik merged commit a8c3f9a into huggingface:master Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning about too long input for fast tokenizers too #8799

Warning about too long input for fast tokenizers too #8799

Narsil commented Nov 26, 2020

Narsil commented Nov 26, 2020

LysandreJik Nov 27, 2020

Narsil Dec 2, 2020

LysandreJik commented Nov 27, 2020

thomwolf left a comment

Narsil commented Dec 2, 2020

Warning about too long input for fast tokenizers too #8799

Warning about too long input for fast tokenizers too #8799

Conversation

Narsil commented Nov 26, 2020

What does this PR do?

Before submitting

Who can review?

Narsil commented Nov 26, 2020

LysandreJik Nov 27, 2020

Choose a reason for hiding this comment

Narsil Dec 2, 2020

Choose a reason for hiding this comment

LysandreJik commented Nov 27, 2020

thomwolf left a comment

Choose a reason for hiding this comment

Narsil commented Dec 2, 2020