Fix flax whisper tokenizer bug #33151

hannan72 · 2024-08-27T19:08:57Z

What does this PR do?

Fixes Bug when using whisper tokenizer for flax whisper model, according to the issue #32936

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker
@sanchit-gandhi

Fix issue with flax whisper model

amyeroberts

Thanks for opening a PR @hannan72!

Please make sure when opening a PR to only tag a minimal subset of the most relevant people. This ensures the PR is reviewed quickly (no by-stander effect) and is good practice and it keeps the number of notifications for everyone in check.

amyeroberts · 2024-08-27T19:15:05Z

src/transformers/models/whisper/tokenization_whisper.py

@@ -849,7 +849,7 @@ def _strip_prompt(self, token_ids: List[int], prompt_token_id: int, decoder_star

        # handle case of empty token_ids for decoding with timestamps.
        # at this point token_ids is a list, so it is safe to use if not check.
-        if not token_ids:
+        if token_ids is None or len(token_ids) == 0:


This doesn't match with the comment above -- which indicates unexpected behaviour in the _convert_to_list method.

The linked issue indicates a problem with checking not on np arrays, but this doesn't correspond to the None check here

Good points!
I also updated the comment
Please take a look again @amyeroberts

The comment still doesn't make sense wrt the change. If token_ids really is a list, then checking not token_ids should be safe

@amyeroberts The problem is that token_ids is not a list for flax models, but it is a jax array. So that it raises error for not token_ids

@amyeroberts Do you have any other suggestion for resolving the issue for flax models?

The main thing is that the function names and comments should be consistent with the logic. So in the comment above it says the object is a list, which appears not to be true. In addition, the issue indicates _convert_to_list isn't actually converting to a list. So the issues to resolve is why isn't _convert_to_list converting to list? Should the method be updated or changed?

@amyeroberts You're right! So we should update _convert_to_list method inorder to convert jax arrays to list as long as numpy, torch and tf arrays.
I updated the PR, reverted the changes in checking token_ids again with not and add a couple of lines in _convert_to_list to convert jax array to list.
Is it OK now?

just check len of token_ids

just use len of token_ids

hannan72 · 2024-08-27T20:12:37Z

Any other points @amyeroberts ?

…pt and add support to jax arrays in _convert_to_list

…d add support to jax arrays in _convert_to_list

hannan72 · 2024-09-02T15:30:32Z

Is it ready to merge @amyeroberts?

amyeroberts · 2024-09-02T15:33:21Z

@hannan72 Thanks for iterating - change now looks OK - final thing is to add a test, which would fail on current main but passes with this fix

…method

…odules if available

hannan72 · 2024-09-05T10:05:56Z

@amyeroberts Test codes has been added and passed by automatic tests. Code you please do the final review?

amyeroberts

Thanks for iterating and adding tests!

General comment that unrelated formatting changes should be removed from the diff. Once the tests are split up we should be good to go

tests/models/whisper/test_tokenization_whisper.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

…stead of `is_xxx_available()` method

hannan72 · 2024-09-07T04:33:36Z

Thanks @amyeroberts for your suggestion! It is applied and tests are split up. Just needs your approval.

hannan72 · 2024-09-09T15:50:08Z

Thanks for iterating and adding tests!

General comment that unrelated formatting changes should be removed from the diff. Once the tests are split up we should be good to go

Anything else?

amyeroberts · 2024-09-09T17:29:50Z

tests/models/whisper/test_tokenization_whisper.py

@@ -204,11 +217,21 @@ def test_skip_special_tokens_skips_prompt_ids(self):
        # fmt: on
        expected_with_special_tokens = "<|startofprev|> Mr. Quilter<|startoftranscript|><|en|><|transcribe|><|notimestamps|> On the general principles of art, Mr. Quilter writes with equal lucidity.<|endoftext|>"
        expected_without_special_tokens = " On the general principles of art, Mr. Quilter writes with equal lucidity."
-        self.assertEqual(tokenizer.decode(encoded_input, skip_special_tokens=False), expected_with_special_tokens)


Can you remove all these changes which shouldn't be applied (our line length is 120 and this is a formatting change unrelated to the PR)

@amyeroberts All unrelated changes have been reverted. Now is it the proper time for merging the PR?

HuggingFaceDocBuilderDev · 2024-09-09T17:47:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ed to PR

hannan72 · 2024-09-12T08:36:05Z

@amyeroberts All changes has been made. I'd appreciate it if you merge the PR!

amyeroberts

Great - thanks for fixing and iterating on this!

* Update tokenization_whisper.py Fix issue with flax whisper model * Update tokenization_whisper_fast.py Fix issue with flax whisper model * Update tokenization_whisper.py just check len of token_ids * Update tokenization_whisper_fast.py just use len of token_ids * Update tokenization_whisper_fast.py and revert changes in _strip_prompt and add support to jax arrays in _convert_to_list * Update tokenization_whisper.py and revert changes in _strip_prompt and add support to jax arrays in _convert_to_list * Update test_tokenization_whisper.py to add test for _convert_to_list method * Update test_tokenization_whisper.py to fix code style issues * Fix code style * Fix code check again * Update test_tokenization)whisper.py to Improve code style * Update test_tokenization_whisper.py to run each of jax, tf and flax modules if available * Update tests/models/whisper/test_tokenization_whisper.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update test_tokenization_whisper.py and use require_xxx decorators instead of `is_xxx_available()` method * Revert the changes automatically applied by formatter and was unrelated to PR * Format for minimal changes --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

hannan72 added 2 commits August 27, 2024 21:03

Update tokenization_whisper.py

d7f393d

Fix issue with flax whisper model

Update tokenization_whisper_fast.py

624c981

Fix issue with flax whisper model

hannan72 mentioned this pull request Aug 27, 2024

Bug in WhisperTokenizer batch_decode, when set skip_special_tokens=True for FlaxWhisper model output #32936

Closed

4 tasks

amyeroberts reviewed Aug 27, 2024

View reviewed changes

hannan72 added 2 commits August 27, 2024 21:29

Update tokenization_whisper.py

7f5b42c

just check len of token_ids

Update tokenization_whisper_fast.py

d41a440

just use len of token_ids

hannan72 added 2 commits August 30, 2024 00:13

Update tokenization_whisper_fast.py and revert changes in _strip_prom…

44e6f1f

…pt and add support to jax arrays in _convert_to_list

Update tokenization_whisper.py and revert changes in _strip_prompt an…

0cd19d3

…d add support to jax arrays in _convert_to_list

hannan72 added 6 commits September 4, 2024 16:11

Update test_tokenization_whisper.py to add test for _convert_to_list …

6ba66e6

…method

Update test_tokenization_whisper.py to fix code style issues

f05078c

Fix code style

969be37

Fix code check again

64093a6

Update test_tokenization)whisper.py to Improve code style

fa1c3b0

Update test_tokenization_whisper.py to run each of jax, tf and flax m…

ce9cd4f

…odules if available

amyeroberts reviewed Sep 5, 2024

View reviewed changes

tests/models/whisper/test_tokenization_whisper.py Outdated Show resolved Hide resolved

hannan72 and others added 2 commits September 7, 2024 05:17

Update tests/models/whisper/test_tokenization_whisper.py

00b2190

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update test_tokenization_whisper.py and use require_xxx decorators in…

afcd3a2

…stead of `is_xxx_available()` method

amyeroberts reviewed Sep 9, 2024

View reviewed changes

hannan72 added 2 commits September 11, 2024 21:22

Revert the changes automatically applied by formatter and was unrelat…

a0cd5d9

…ed to PR

Format for minimal changes

2775277

amyeroberts approved these changes Sep 12, 2024

View reviewed changes

amyeroberts merged commit 8ed6352 into huggingface:main Sep 12, 2024
18 checks passed

hannan72 deleted the fix_flax_whisper_tokenizer_bug branch September 12, 2024 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flax whisper tokenizer bug #33151

Fix flax whisper tokenizer bug #33151

hannan72 commented Aug 27, 2024 •

edited

Loading

amyeroberts left a comment •

edited

Loading

amyeroberts Aug 27, 2024

hannan72 Aug 27, 2024 •

edited

Loading

amyeroberts Aug 28, 2024

hannan72 Aug 28, 2024

hannan72 Aug 28, 2024

amyeroberts Aug 29, 2024

hannan72 Aug 29, 2024

hannan72 commented Aug 27, 2024

hannan72 commented Sep 2, 2024

amyeroberts commented Sep 2, 2024

hannan72 commented Sep 5, 2024

amyeroberts left a comment

hannan72 commented Sep 7, 2024 •

edited

Loading

hannan72 commented Sep 9, 2024

amyeroberts Sep 9, 2024

hannan72 Sep 11, 2024

HuggingFaceDocBuilderDev commented Sep 9, 2024

hannan72 commented Sep 12, 2024

amyeroberts left a comment

Fix flax whisper tokenizer bug #33151

Fix flax whisper tokenizer bug #33151

Conversation

hannan72 commented Aug 27, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

amyeroberts left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannan72 Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannan72 commented Aug 27, 2024

hannan72 commented Sep 2, 2024

amyeroberts commented Sep 2, 2024

hannan72 commented Sep 5, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

hannan72 commented Sep 7, 2024 • edited Loading

hannan72 commented Sep 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 9, 2024

hannan72 commented Sep 12, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

hannan72 commented Aug 27, 2024 •

edited

Loading

amyeroberts left a comment •

edited

Loading

hannan72 Aug 27, 2024 •

edited

Loading

hannan72 commented Sep 7, 2024 •

edited

Loading