🚨🚨[Whisper Tok] Update integration test #29368

sanchit-gandhi · 2024-02-29T11:04:58Z

What does this PR do?

The merges for the Whisper tokenizers were updated on the Hub in this PR. While this is a breaking change, it is a required fix to ensure we have parity with the original OpenAI repo.

This PR updates the integration tests for the Whisper tokenizer to reflect the merge changes.

sanchit-gandhi · 2024-02-29T11:06:05Z

tests/models/whisper/test_tokenization_whisper.py


        self.assertListEqual(
            tokenizer.convert_tokens_to_ids(tokens),
-            [5723, 307, 257, 220, 31636],
+            [5723, 307, 257, 1500],


This now gives equivalent results to the original:

from whisper.tokenizer import get_tokenizer tokenizer = get_tokenizer(True) tokens = tokenizer.encode("This is a test") print(tokens)

Print Output:

[5723, 307, 257, 1500]

sanchit-gandhi · 2024-02-29T11:06:36Z

tests/models/whisper/test_tokenization_whisper.py

@@ -499,25 +499,3 @@ def test_offset_decoding(self):

        output = multilingual_tokenizer.decode(INPUT_TOKENS, output_offsets=True)["offsets"]
        self.assertEqual(output, [])
-
-    @require_jinja
-    def test_tokenization_for_chat(self):


Chat template doesn't make sense for Whisper (a speech recognition model) - have removed the test to keep the CI lightweight (cc @Rocketknight1)

Fine with me!

sanchit-gandhi · 2024-02-29T11:25:45Z

Also cc @ydshieh as this PR will prevent a red CI on main

HuggingFaceDocBuilderDev · 2024-02-29T11:35:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks for the prompt fix, it's breaking so I'll probably update the PR tittle with ⚠️

sanchit-gandhi · 2024-03-01T09:22:28Z

The GH PR itself is not strictly breaking (there's no change to the code), but rather it's the Hub PR which is breaking. Fine for me to leave the 🚨 in the title though to book-log this!

* [Whisper Tok] Update integration test * make style

sanchit-gandhi added 2 commits February 29, 2024 11:02

[Whisper Tok] Update integration test

f82c997

make style

196f072

sanchit-gandhi commented Feb 29, 2024

View reviewed changes

sanchit-gandhi requested a review from ArthurZucker February 29, 2024 11:07

ArthurZucker approved these changes Mar 1, 2024

View reviewed changes

ArthurZucker changed the title ~~[Whisper Tok] Update integration test~~ 🚨🚨[Whisper Tok] Update integration test Mar 1, 2024

sanchit-gandhi merged commit 0a0a279 into huggingface:main Mar 1, 2024
18 checks passed

sanchit-gandhi deleted the whisper-tokenizer branch March 1, 2024 09:22

itazap pushed a commit that referenced this pull request May 14, 2024

🚨🚨[Whisper Tok] Update integration test (#29368)

551e245

* [Whisper Tok] Update integration test * make style

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚨🚨[Whisper Tok] Update integration test #29368

🚨🚨[Whisper Tok] Update integration test #29368

sanchit-gandhi commented Feb 29, 2024

sanchit-gandhi Feb 29, 2024

sanchit-gandhi Feb 29, 2024

Rocketknight1 Feb 29, 2024

sanchit-gandhi commented Feb 29, 2024

HuggingFaceDocBuilderDev commented Feb 29, 2024

ArthurZucker left a comment

sanchit-gandhi commented Mar 1, 2024

🚨🚨[Whisper Tok] Update integration test #29368

🚨🚨[Whisper Tok] Update integration test #29368

Conversation

sanchit-gandhi commented Feb 29, 2024

What does this PR do?

sanchit-gandhi Feb 29, 2024

Choose a reason for hiding this comment

sanchit-gandhi Feb 29, 2024

Choose a reason for hiding this comment

Rocketknight1 Feb 29, 2024

Choose a reason for hiding this comment

sanchit-gandhi commented Feb 29, 2024

HuggingFaceDocBuilderDev commented Feb 29, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

sanchit-gandhi commented Mar 1, 2024