fix prompt strip to support tensors and np arrays #27818

AvivSham · 2023-12-03T16:02:46Z

What does this PR do?

WhisperTokenizer does not strip the prompt ids when calling decode for torch.Tensor, tf.Tensor, and np.ndarray. It does not perform the slicing in _strip_prompt because the condition isinstance(token_ids, list) is not met.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sanchit-gandhi
@ArthurZucker

AvivSham · 2023-12-03T20:08:35Z

I was not able to make it pass the CI/CD tests due to import issues. @sanchit-gandhi can you please guide?

sanchit-gandhi

Hey @AvivSham! Thanks for opening this PR - the first draft looks in decent nick!

One thing we need to change is how we check for Torch/TF tensors: the feature extraction file should be agnostic to framework. Imagine we have a user running Whisper in PyTorch. We do not force them to have TF installed, since there's no need to have both PyTorch and TF installed if just using PyTorch. Therefore, we cannot assume TF is an available import, and thus can't leverage it in the feature extraction file. The converse is true for someone using TF, where we don't expect them to have PyTorch installed.

=> what we need to do here is refactor the changes such that they are agnostic to framework. I've left some hints on how to do this below! Let me know if you have any questions - happy to help!

sanchit-gandhi · 2024-01-17T17:36:53Z

src/transformers/models/whisper/tokenization_whisper.py

        if has_prompt:
+            if not isinstance(token_ids, list):
+                token_ids = self._convert_to_list(token_ids)


Let's convert the token ids to a list before checking whether we have a prompt. If we convert the token ids to a list first, then we don't need to check:

isinstance(token_ids, (List[int], np.ndarray, torch.Tensor, tf.Tensor))

=> since we know token_ids will be a list!

This is still valid! We need to convert to list before checking whether we have a prompt

src/transformers/models/whisper/tokenization_whisper.py

AvivSham · 2024-01-18T14:23:35Z

Thank you for reviewing @sanchit-gandhi !
I was not aware of the framework agnostic constraint, I think we are ok now, can you please review it again?

github-actions · 2024-02-12T08:04:45Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

AvivSham · 2024-02-12T08:30:40Z

This PR is not stale, waiting for @sanchit-gandhi to approve.

amyeroberts · 2024-03-08T11:23:40Z

Gentle ping @sanchit-gandhi

AvivSham · 2024-03-31T08:47:29Z

@amyeroberts @sanchit-gandhi
Any news about this PR?

AvivSham · 2024-04-25T08:10:31Z

@amyeroberts This issue is not stale, still waiting for @sanchit-gandhi.

amyeroberts · 2024-04-25T09:35:12Z

@sanchit-gandhi @ylacombe Can one of you review this?

sanchit-gandhi

Thanks for the last round of commits @AvivSham! Super sorry for the late review here

I would recommend using the following script to quickly debug your changes and confirm they're working (currently the following code fails since we don't strip the prompt for numpy inputs):

import numpy as np
from transformers import WhisperTokenizer

prompt_text = "the cat sat on the mat"
input_text = "the quick brown fox"

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny")
prompt_ids = tokenizer.get_prompt_ids("the cat sat on the mat", return_tensors="np")

input_ids = tokenizer("the quick brown fox", return_tensors="np").input_ids[0]

input_ids = np.hstack([prompt_ids, input_ids])

# check with prompt in output
pred_text = tokenizer.decode(input_ids, skip_special_tokens=False)
assert pred_text.strip() == prompt_text + input_text

# check stripping prompt from output
pred_text = tokenizer.decode(input_ids, skip_special_tokens=True)
assert pred_text.strip() == input_text

Along with the subtle changes to the order of operations (which will fix the above example), it would be great to add a slow/fast tokenizer test to confirm we get the correct behaviour after this change. To do this, you can add a test in the file test_tokenization_whisper.py. I would follow the existing logic to load a pre-trained slow and fast (or "rust") tokenizers, then run a similar check to the one we have above. You can use the following test as an example:

transformers/tests/models/whisper/test_tokenization_whisper.py

Lines 269 to 271 in ce3647a

    
           def test_basic_normalizer(self): 
        
               tokenizer = self.get_tokenizer() 
        
               rust_tokenizer = self.get_rust_tokenizer()

Let me know if you have any questions! Happy to help with the final changes before merge!

src/transformers/models/whisper/tokenization_whisper.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

AvivSham · 2024-06-12T17:58:40Z

Hi @sanchit-gandhi,
Added the requested changes (including adding the test), can you review again (hopefully for the last time 😄)?
BTW I also pulled transformers main branch so we will be aligned.

sanchit-gandhi

Hey @AvivSham! The functionality looks great here! Just a few clean-up comments below. Once done, could you run:

make style

In order to run the linter to format the code? If you hit issues regarding missing packages, you can install them with the following from the root of the Transformers repo:

pip install -e ".[quality]"

Many thanks!

src/transformers/models/whisper/tokenization_whisper_fast.py

tests/models/whisper/test_tokenization_whisper.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

AvivSham · 2024-06-13T16:58:40Z

@sanchit-gandhi
Done all the changes required including reverting and running the style. Please check again.

sanchit-gandhi

Looks like there was a minor typo (suggested a fix below), which running make fix-copies should check is correct. By the way, there's great documentation on the style checks we use in Transformers in this contributing guide.

It looks like a Whisper tokenizer test is failing on the CI for this PR (traceback). You can run the test locally from the root of the repo with the following command:

pytest -sv tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_skip_special_tokens_with_timestamps

Could you have a look to see why this test is failing after these PR changes? Happy to help if it's unclear! If you have any dependency issues when running the test, you can install the necessary packages with:

pip install -e ".[quality]"

src/transformers/models/whisper/tokenization_whisper_fast.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

AvivSham · 2024-06-18T19:13:56Z

Hi @sanchit-gandhi ,
This test fails since token_ids is an empty list. IMO this should be handled elsewhere and much earlier maybe in this line, I cannot this about a case that user will try to decode empty list. Maybe _preprocess_token_ids would be a better place to add it.
WDYT?
if you agree I can add this to the current PR:

if not isinstance(token_ids, int):
    if len(token_ids) == 0:
        raise ValueError("No token ids provided for decoding.")

or if you see a reason for decoding an empty string we can return token_ids instead of raising the exception.

sanchit-gandhi · 2024-06-21T09:12:25Z

Hey @AvivSham - sorry I don't fully follow! If we look at the test test_skip_special_tokens_with_timestamps, we see that we always pass a non-empty array of token-ids to the tokeniser's .decode method. Running this test on main, we see that we always get the expected decoding results. For this PR, we need to make sure the changes we implement do not change the current decoding behaviour, which means that the test should pass. I would recommend using the debugger to check the behaviour of _preprocess_token_ids and _strip_prompt before and after the changes, and try to unify this PR to how it is originally done. If you get stuck, let me know and I can take a deeper look!

AvivSham · 2024-06-21T10:55:05Z

Ok, so I debugged it and the empty token_ids comes from the method _decode_with_timestamps check line 553 under tokenization_whisper.py file.
I made a small adjustment to address this issue, and all the tests passed successfully.
@sanchit-gandhi please have another look at it 😄

AvivSham · 2024-07-08T14:00:32Z

@amyeroberts @sanchit-gandhi

amyeroberts · 2024-07-08T15:30:11Z

Hi @AvivSham, you need to run make fix-copies to resolve the failing quality checks.

cc @kamilakesbi For first review as @sanchit-gandhi is off

AvivSham · 2024-07-08T15:47:17Z

@amyeroberts I ran it and committed. Let's see if the CI/CD pass this time.

AvivSham · 2024-07-11T13:02:13Z

@amyeroberts @kamilakesbi
Gentle reminder 😃

kamilakesbi

Hi @AvivSham, thanks for working on this!

It LGTM! we can merge this PR @amyeroberts :)

HuggingFaceDocBuilderDev · 2024-07-12T08:59:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for adding! Just one comment

src/transformers/models/whisper/tokenization_whisper.py

AvivSham · 2024-07-12T11:09:43Z

Hi @amyeroberts,
This is a mistake we can resolve this comment. See these lines.

amyeroberts

Thanks for adding!

AvivSham · 2024-07-14T13:07:12Z

@amyeroberts I keep receiving emails regarding CI/CD failed runs how can we prevent that? I thought it would stop once the PR is merged.

amyeroberts · 2024-07-14T15:26:29Z

Hi @AvivSham, this is because of the settings on your fork.

You can disable the CI runs by going to the repo, then Settings > Actions > Disable actions

AvivSham · 2024-07-15T06:04:24Z

@amyeroberts Thank you very much for helping.

* fix prompt strip to support tensors and np arrays * framework agnostic * change logic check before converting prompt into list Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * adding _convert_to_list to tokenization_whisper_fast * adding tests for prompt decoding * adding comment Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * adding comment Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * revert minor * make style formatting * style formatting after update * Update src/transformers/models/whisper/tokenization_whisper_fast.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * fixing _strip_prompt to handle _decode_with_timestamps * fix copies --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

fix prompt strip to support tensors and np arrays

457d312

AvivSham force-pushed the main branch from cf1a517 to 457d312 Compare December 3, 2023 20:05

huggingface deleted a comment from github-actions bot Jan 3, 2024

sanchit-gandhi reviewed Jan 17, 2024

View reviewed changes

framework agnostic

ce26aae

huggingface deleted a comment from github-actions bot Mar 8, 2024

huggingface deleted a comment from github-actions bot Apr 25, 2024

huggingface deleted a comment from github-actions bot May 20, 2024

amyeroberts added the Audio label May 20, 2024

sanchit-gandhi reviewed Jun 11, 2024

View reviewed changes

src/transformers/models/whisper/tokenization_whisper.py Outdated Show resolved Hide resolved

src/transformers/models/whisper/tokenization_whisper.py Show resolved Hide resolved

src/transformers/models/whisper/tokenization_whisper.py Outdated Show resolved Hide resolved

AvivSham and others added 3 commits June 12, 2024 16:17

change logic check before converting prompt into list

3eac8f2

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

adding _convert_to_list to tokenization_whisper_fast

739e08f

adding tests for prompt decoding

5133820

pull hf transformers into fork

6122708

sanchit-gandhi reviewed Jun 13, 2024

View reviewed changes

AvivSham and others added 6 commits June 13, 2024 18:27

adding comment

9ff5a89

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

adding comment

fbcb423

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

revert minor

eabe16c

make style formatting

655ed9e

Merge branch 'main' of https://github.com/AvivSham/transformers

4ebd479

style formatting after update

1ad1952

sanchit-gandhi reviewed Jun 18, 2024

View reviewed changes

src/transformers/models/whisper/tokenization_whisper_fast.py Outdated Show resolved Hide resolved

Update src/transformers/models/whisper/tokenization_whisper_fast.py

2b8dcd4

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

AvivSham added 2 commits June 21, 2024 13:54

fixing _strip_prompt to handle _decode_with_timestamps

e6ffd89

Merge branch 'main' of https://github.com/AvivSham/transformers

0df4173

fix copies

9a2f7a6

amyeroberts requested a review from kamilakesbi July 9, 2024 22:12

kamilakesbi reviewed Jul 12, 2024

View reviewed changes

amyeroberts reviewed Jul 12, 2024

View reviewed changes

src/transformers/models/whisper/tokenization_whisper.py Show resolved Hide resolved

amyeroberts approved these changes Jul 12, 2024

View reviewed changes

amyeroberts merged commit 7f79a97 into huggingface:main Jul 12, 2024
21 checks passed

gante mentioned this pull request Jul 14, 2024

Whisper: move to tensor cpu before converting to np array at decode time #31954

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix prompt strip to support tensors and np arrays #27818

fix prompt strip to support tensors and np arrays #27818

AvivSham commented Dec 3, 2023 •

edited

Loading

AvivSham commented Dec 3, 2023

sanchit-gandhi left a comment

sanchit-gandhi Jan 17, 2024

sanchit-gandhi Jun 11, 2024

AvivSham commented Jan 18, 2024

github-actions bot commented Feb 12, 2024

AvivSham commented Feb 12, 2024

amyeroberts commented Mar 8, 2024

AvivSham commented Mar 31, 2024

AvivSham commented Apr 25, 2024

amyeroberts commented Apr 25, 2024

sanchit-gandhi left a comment

AvivSham commented Jun 12, 2024 •

edited

Loading

sanchit-gandhi left a comment

AvivSham commented Jun 13, 2024

sanchit-gandhi left a comment

AvivSham commented Jun 18, 2024

sanchit-gandhi commented Jun 21, 2024

AvivSham commented Jun 21, 2024

AvivSham commented Jul 8, 2024

amyeroberts commented Jul 8, 2024 •

edited

Loading

AvivSham commented Jul 8, 2024 •

edited

Loading

AvivSham commented Jul 11, 2024

kamilakesbi left a comment

HuggingFaceDocBuilderDev commented Jul 12, 2024

amyeroberts left a comment

AvivSham commented Jul 12, 2024

amyeroberts left a comment

AvivSham commented Jul 14, 2024

amyeroberts commented Jul 14, 2024

AvivSham commented Jul 15, 2024

	def test_basic_normalizer(self):
	tokenizer = self.get_tokenizer()
	rust_tokenizer = self.get_rust_tokenizer()

fix prompt strip to support tensors and np arrays #27818

fix prompt strip to support tensors and np arrays #27818

Conversation

AvivSham commented Dec 3, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

AvivSham commented Dec 3, 2023

sanchit-gandhi left a comment

Choose a reason for hiding this comment

sanchit-gandhi Jan 17, 2024

Choose a reason for hiding this comment

sanchit-gandhi Jun 11, 2024

Choose a reason for hiding this comment

AvivSham commented Jan 18, 2024

github-actions bot commented Feb 12, 2024

AvivSham commented Feb 12, 2024

amyeroberts commented Mar 8, 2024

AvivSham commented Mar 31, 2024

AvivSham commented Apr 25, 2024

amyeroberts commented Apr 25, 2024

sanchit-gandhi left a comment

Choose a reason for hiding this comment

AvivSham commented Jun 12, 2024 • edited Loading

sanchit-gandhi left a comment

Choose a reason for hiding this comment

AvivSham commented Jun 13, 2024

sanchit-gandhi left a comment

Choose a reason for hiding this comment

AvivSham commented Jun 18, 2024

sanchit-gandhi commented Jun 21, 2024

AvivSham commented Jun 21, 2024

AvivSham commented Jul 8, 2024

amyeroberts commented Jul 8, 2024 • edited Loading

AvivSham commented Jul 8, 2024 • edited Loading

AvivSham commented Jul 11, 2024

kamilakesbi left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jul 12, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

AvivSham commented Jul 12, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

AvivSham commented Jul 14, 2024

amyeroberts commented Jul 14, 2024

AvivSham commented Jul 15, 2024

AvivSham commented Dec 3, 2023 •

edited

Loading

AvivSham commented Jun 12, 2024 •

edited

Loading

amyeroberts commented Jul 8, 2024 •

edited

Loading

AvivSham commented Jul 8, 2024 •

edited

Loading