Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix prompt strip to support tensors and np arrays #27818

Merged
merged 16 commits into from
Jul 12, 2024

Conversation

AvivSham
Copy link
Contributor

@AvivSham AvivSham commented Dec 3, 2023

What does this PR do?

WhisperTokenizer does not strip the prompt ids when calling decode for torch.Tensor, tf.Tensor, and np.ndarray. It does not perform the slicing in _strip_prompt because the condition isinstance(token_ids, list) is not met.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@sanchit-gandhi
@ArthurZucker

@AvivSham
Copy link
Contributor Author

AvivSham commented Dec 3, 2023

I was not able to make it pass the CI/CD tests due to import issues. @sanchit-gandhi can you please guide?

@huggingface huggingface deleted a comment from github-actions bot Jan 3, 2024
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @AvivSham! Thanks for opening this PR - the first draft looks in decent nick!

One thing we need to change is how we check for Torch/TF tensors: the feature extraction file should be agnostic to framework. Imagine we have a user running Whisper in PyTorch. We do not force them to have TF installed, since there's no need to have both PyTorch and TF installed if just using PyTorch. Therefore, we cannot assume TF is an available import, and thus can't leverage it in the feature extraction file. The converse is true for someone using TF, where we don't expect them to have PyTorch installed.

=> what we need to do here is refactor the changes such that they are agnostic to framework. I've left some hints on how to do this below! Let me know if you have any questions - happy to help!

if has_prompt:
if not isinstance(token_ids, list):
token_ids = self._convert_to_list(token_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's convert the token ids to a list before checking whether we have a prompt. If we convert the token ids to a list first, then we don't need to check:

isinstance(token_ids, (List[int], np.ndarray, torch.Tensor, tf.Tensor))

=> since we know token_ids will be a list!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still valid! We need to convert to list before checking whether we have a prompt

src/transformers/models/whisper/tokenization_whisper.py Outdated Show resolved Hide resolved
@AvivSham
Copy link
Contributor Author

Thank you for reviewing @sanchit-gandhi !
I was not aware of the framework agnostic constraint, I think we are ok now, can you please review it again?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@AvivSham
Copy link
Contributor Author

This PR is not stale, waiting for @sanchit-gandhi to approve.

@huggingface huggingface deleted a comment from github-actions bot Mar 8, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @sanchit-gandhi

@AvivSham
Copy link
Contributor Author

@amyeroberts @sanchit-gandhi
Any news about this PR?

@AvivSham
Copy link
Contributor Author

@amyeroberts This issue is not stale, still waiting for @sanchit-gandhi.

@huggingface huggingface deleted a comment from github-actions bot Apr 25, 2024
@amyeroberts
Copy link
Collaborator

@sanchit-gandhi @ylacombe Can one of you review this?

@huggingface huggingface deleted a comment from github-actions bot May 20, 2024
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the last round of commits @AvivSham! Super sorry for the late review here

I would recommend using the following script to quickly debug your changes and confirm they're working (currently the following code fails since we don't strip the prompt for numpy inputs):

import numpy as np
from transformers import WhisperTokenizer

prompt_text = "the cat sat on the mat"
input_text = "the quick brown fox"

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny")
prompt_ids = tokenizer.get_prompt_ids("the cat sat on the mat", return_tensors="np")

input_ids = tokenizer("the quick brown fox", return_tensors="np").input_ids[0]

input_ids = np.hstack([prompt_ids, input_ids])

# check with prompt in output
pred_text = tokenizer.decode(input_ids, skip_special_tokens=False)
assert pred_text.strip() == prompt_text + input_text

# check stripping prompt from output
pred_text = tokenizer.decode(input_ids, skip_special_tokens=True)
assert pred_text.strip() == input_text

Along with the subtle changes to the order of operations (which will fix the above example), it would be great to add a slow/fast tokenizer test to confirm we get the correct behaviour after this change. To do this, you can add a test in the file test_tokenization_whisper.py. I would follow the existing logic to load a pre-trained slow and fast (or "rust") tokenizers, then run a similar check to the one we have above. You can use the following test as an example:

def test_basic_normalizer(self):
tokenizer = self.get_tokenizer()
rust_tokenizer = self.get_rust_tokenizer()

Let me know if you have any questions! Happy to help with the final changes before merge!

@AvivSham
Copy link
Contributor Author

AvivSham commented Jun 12, 2024

Hi @sanchit-gandhi,
Added the requested changes (including adding the test), can you review again (hopefully for the last time 😄)?
BTW I also pulled transformers main branch so we will be aligned.

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @AvivSham! The functionality looks great here! Just a few clean-up comments below. Once done, could you run:

make style

In order to run the linter to format the code? If you hit issues regarding missing packages, you can install them with the following from the root of the Transformers repo:

pip install -e ".[quality]"

Many thanks!

AvivSham and others added 6 commits June 13, 2024 18:27
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
@AvivSham
Copy link
Contributor Author

@sanchit-gandhi
Done all the changes required including reverting and running the style. Please check again.

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there was a minor typo (suggested a fix below), which running make fix-copies should check is correct. By the way, there's great documentation on the style checks we use in Transformers in this contributing guide.

It looks like a Whisper tokenizer test is failing on the CI for this PR (traceback). You can run the test locally from the root of the repo with the following command:

pytest -sv tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_skip_special_tokens_with_timestamps

Could you have a look to see why this test is failing after these PR changes? Happy to help if it's unclear! If you have any dependency issues when running the test, you can install the necessary packages with:

pip install -e ".[quality]"

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
@AvivSham
Copy link
Contributor Author

Hi @sanchit-gandhi ,
This test fails since token_ids is an empty list. IMO this should be handled elsewhere and much earlier maybe in this line, I cannot this about a case that user will try to decode empty list. Maybe _preprocess_token_ids would be a better place to add it.
WDYT?
if you agree I can add this to the current PR:

if not isinstance(token_ids, int):
    if len(token_ids) == 0:
        raise ValueError("No token ids provided for decoding.")

or if you see a reason for decoding an empty string we can return token_ids instead of raising the exception.

@sanchit-gandhi
Copy link
Contributor

Hey @AvivSham - sorry I don't fully follow! If we look at the test test_skip_special_tokens_with_timestamps, we see that we always pass a non-empty array of token-ids to the tokeniser's .decode method. Running this test on main, we see that we always get the expected decoding results. For this PR, we need to make sure the changes we implement do not change the current decoding behaviour, which means that the test should pass. I would recommend using the debugger to check the behaviour of _preprocess_token_ids and _strip_prompt before and after the changes, and try to unify this PR to how it is originally done. If you get stuck, let me know and I can take a deeper look!

@AvivSham
Copy link
Contributor Author

Ok, so I debugged it and the empty token_ids comes from the method _decode_with_timestamps check line 553 under tokenization_whisper.py file.
I made a small adjustment to address this issue, and all the tests passed successfully.
@sanchit-gandhi please have another look at it 😄

@AvivSham
Copy link
Contributor Author

AvivSham commented Jul 8, 2024

@amyeroberts
Copy link
Collaborator

amyeroberts commented Jul 8, 2024

Hi @AvivSham, you need to run make fix-copies to resolve the failing quality checks.

cc @kamilakesbi For first review as @sanchit-gandhi is off

@AvivSham
Copy link
Contributor Author

AvivSham commented Jul 8, 2024

@amyeroberts I ran it and committed. Let's see if the CI/CD pass this time.

@amyeroberts amyeroberts requested a review from kamilakesbi July 9, 2024 22:12
@AvivSham
Copy link
Contributor Author

@amyeroberts @kamilakesbi
Gentle reminder 😃

Copy link
Contributor

@kamilakesbi kamilakesbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AvivSham, thanks for working on this!

It LGTM! we can merge this PR @amyeroberts :)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding! Just one comment

@AvivSham
Copy link
Contributor Author

Hi @amyeroberts,
This is a mistake we can resolve this comment. See these lines.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding!

@amyeroberts amyeroberts merged commit 7f79a97 into huggingface:main Jul 12, 2024
21 checks passed
@AvivSham
Copy link
Contributor Author

@amyeroberts I keep receiving emails regarding CI/CD failed runs how can we prevent that? I thought it would stop once the PR is merged.
image

@amyeroberts
Copy link
Collaborator

Hi @AvivSham, this is because of the settings on your fork.

You can disable the CI runs by going to the repo, then Settings > Actions > Disable actions

image

@AvivSham
Copy link
Contributor Author

@amyeroberts Thank you very much for helping.

amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jul 19, 2024
* fix prompt strip to support tensors and np arrays

* framework agnostic

* change logic check before converting prompt into list

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* adding _convert_to_list to tokenization_whisper_fast

* adding tests for prompt decoding

* adding comment

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* adding comment

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* revert minor

* make style formatting

* style formatting after update

* Update src/transformers/models/whisper/tokenization_whisper_fast.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* fixing _strip_prompt to handle _decode_with_timestamps

* fix copies

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
MHRDYN7 pushed a commit to MHRDYN7/transformers that referenced this pull request Jul 23, 2024
* fix prompt strip to support tensors and np arrays

* framework agnostic

* change logic check before converting prompt into list

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* adding _convert_to_list to tokenization_whisper_fast

* adding tests for prompt decoding

* adding comment

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* adding comment

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* revert minor

* make style formatting

* style formatting after update

* Update src/transformers/models/whisper/tokenization_whisper_fast.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* fixing _strip_prompt to handle _decode_with_timestamps

* fix copies

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Jul 24, 2024
* fix prompt strip to support tensors and np arrays

* framework agnostic

* change logic check before converting prompt into list

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* adding _convert_to_list to tokenization_whisper_fast

* adding tests for prompt decoding

* adding comment

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* adding comment

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* revert minor

* make style formatting

* style formatting after update

* Update src/transformers/models/whisper/tokenization_whisper_fast.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* fixing _strip_prompt to handle _decode_with_timestamps

* fix copies

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
itazap pushed a commit that referenced this pull request Jul 25, 2024
* fix prompt strip to support tensors and np arrays

* framework agnostic

* change logic check before converting prompt into list

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* adding _convert_to_list to tokenization_whisper_fast

* adding tests for prompt decoding

* adding comment

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* adding comment

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* revert minor

* make style formatting

* style formatting after update

* Update src/transformers/models/whisper/tokenization_whisper_fast.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* fixing _strip_prompt to handle _decode_with_timestamps

* fix copies

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants