Add validation for maximum sequence length in modeling_whisper.py #33196

AmirMohammadFakhimi · 2024-08-29T12:01:33Z

What does this PR do?

Added a validation check to ensure that the sequence length of labels does not exceed the maximum allowed length of 448 tokens. If the sequence length exceeds this limit, a ValueError is raised with a descriptive error message.

This change prevents the model from encountering errors or unexpected behavior due to excessively long sequences during training or fine-tuning, ensuring consistent input dimensions and improving overall robustness.

While training, I encountered the following error multiple times:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1167,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1167,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.

Eventually, this led to the following error message:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging, consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

It took me almost a week to diagnose and understand the root cause of this problem. This simple validation can save others from facing similar debugging challenges and wasted time.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

speech models: @sanchit-gandhi

Added a validation check to ensure that the sequence length of labels does not exceed the maximum allowed length of 448 tokens. If the sequence length exceeds this limit, a ValueError is raised with a descriptive error message. This change prevents the model from encountering errors or unexpected behavior due to excessively long sequences during training or fine-tuning, ensuring consistent input dimensions and improving overall robustness.

LysandreJik · 2024-08-30T13:30:53Z

cc @ylacombe and @eustlb when back from leave!

ylacombe

Hey @AmirMohammadFakhimi, many thanks for opening this PR!

Do you think you could give a minimum code snippet to reproduce the issue you have? It doesn't have to be a training code, I think a forward pass with handmade labels should do the trick!

Thanks for your help

cc @eustlb for visibility

src/transformers/models/whisper/modeling_whisper.py

ylacombe · 2024-09-04T08:34:26Z

Hey @AmirMohammadFakhimi, thanks for providing the code snippet!

So, according to Whisper configuration, there's two hard limits on tokens and spectrograms lengths: max_source_positions and max_target_positions. The latter is where you can find the hard 448 token limits.

If you wanted to modify that limit, you'd have to modify the architecture, and thus to re-train the model from scratch.

Your PR is justified, I'd just modify the mention to 448 to config.max_target_positions, e.g:

-            if labels.shape[1] > 448:
+           if labels.shape[1] > config.max_target_positions:

ylacombe

Could you also add a test here that makes sure the model works with labels sequence length smaller than the limit, and that makes sure it throws an error if not ?

src/transformers/models/whisper/modeling_whisper.py

…whisper.py The exception message is for whisper's label's sequence max length. Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com>

…whisper/modeling_whisper.py It's for whisper's config.max_target_positions. Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com>

…ling_whisper.py

…ality

…de_quality

…nerate

…de_quality

…er.py

…de_quality

AmirMohammadFakhimi · 2024-09-04T15:38:01Z

Could you also add a test here that makes sure the model works with labels sequence length smaller than the limit, and that makes sure it throws an error if not ?

I added three tests to the file. I also merged branch 'huggingface:main' into patch-1.
For testing, I made a hardcode here because I think it is hardcoded in the whisper model, too. Do you have any better idea?

And finally, just for info, there are these lines in test_modeling_whisper.py:

config, inputs_dict = self.prepare_config_and_inputs()

config, input_dict = self.prepare_config_and_inputs()

config, inputs = self.model_tester.prepare_config_and_inputs()

I used the second format. I think having a convention for it will help.

ylacombe

Thanks for iterating ! Let's not hardcode the max length !

tests/models/whisper/test_modeling_whisper.py

ylacombe · 2024-09-04T16:26:33Z

tests/models/whisper/test_modeling_whisper.py

+            model = model_class(config)
+            input_features = input_dict["input_features"]
+
+            labels_length = 449


For example here it would be labels_length = config.max_target_positions

tests/models/whisper/test_modeling_whisper.py

…er.py

…de_quality

…nerate

ylacombe

Thanks for iterating!
This is a bit complicate, you can do it like propose in this review!

tests/models/whisper/test_modeling_whisper.py

HuggingFaceDocBuilderDev · 2024-09-04T17:18:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

… test_modeling_whisper.py

…modeling_whisper.py

ylacombe

Thanks for iterating quickly on this !

cc @LysandreJik for review!

src/transformers/models/whisper/modeling_whisper.py

…ggingface#33196) * Add validation for maximum sequence length in modeling_whisper.py Added a validation check to ensure that the sequence length of labels does not exceed the maximum allowed length of 448 tokens. If the sequence length exceeds this limit, a ValueError is raised with a descriptive error message. This change prevents the model from encountering errors or unexpected behavior due to excessively long sequences during training or fine-tuning, ensuring consistent input dimensions and improving overall robustness. * Change exception message in src/transformers/models/whisper/modeling_whisper.py The exception message is for whisper's label's sequence max length. Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> * Change 448 to config.max_target_positions in src/transformers/models/whisper/modeling_whisper.py It's for whisper's config.max_target_positions. Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> * Change method's documentation in src/transformers/models/whisper/modeling_whisper.py * Add test for maximum label's sequence length in test_modeling_whisper.py * Add self to modeling_whisper.py * Update test_modeling_whisper.py with respect to automatic validations * Update modeling_whisper.py with respect to ci/circleci: check_code_quality * Update test_modeling_whisper.py with respect to ci/circleci: check_code_quality * Update test_modeling_whisper.py with respect to ci/circleci: tests_generate * Update test_modeling_whisper.py with respect to ci/circleci: tests_generate * Update test_modeling_whisper.py with respect to ci/circleci: check_code_quality * Separate test_labels_sequence_max_length tests in test_modeling_whisper.py * Update test_modeling_whisper.py with respect to ci/circleci: check_code_quality * Remove assert from test_modeling_whisper.py * Add max_target_positions to WhisperModelTester in test_modeling_whisper.py * Update test_modeling_whisper.py with respect to ci/circleci: check_code_quality * Update test_modeling_whisper.py with respect to ci/circleci: tests_generate * Update test_modeling_whisper.py * Change test_labels_sequence_max_length_error_after_changing_config in test_modeling_whisper.py * Change self.config.max_target_positions to self.max_target_positions modeling_whisper.py * Add new tests in test_modeling_whisper.py * Update test_modeling_whisper.py --------- Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com>

jugal-sheth · 2024-11-15T10:25:02Z

try below

# Initialize parameters and tokenizer
transcripts_folder = "path_to_transcripts"
max_target_length = 448
show_target_length = 400
tokenizer = InitializeTokenizer("model_name", language="en", task="transcribe")

# Define the function to check and delete faulty transcripts
function check_and_delete_transcripts():
    files = ListAllFiles(transcripts_folder, extension=".txt")  # Get all .txt files

    for file in ProgressBar(files, description="Checking transcripts"):
        content = ReadFile(file)  # Read the transcript content
        tokenized = TokenizeContent(tokenizer, content)  # Tokenize the content
        token_length = GetTokenLength(tokenized)  # Get the length of tokens

        if token_length > max_target_length:
            Print(f"Faulty file: {file}, Tokens: {token_length}")
            DeleteFile(file)  # Delete if too long
        elif token_length > show_target_length:
            Print(f"File exceeds show limit: {file}")
            PrintContent(content)  # Optionally delete or inspect further

# Execute the function
check_and_delete_transcripts()

this will help you get rid of faulty data then finetune the model

…ggingface#33196) * Add validation for maximum sequence length in modeling_whisper.py Added a validation check to ensure that the sequence length of labels does not exceed the maximum allowed length of 448 tokens. If the sequence length exceeds this limit, a ValueError is raised with a descriptive error message. This change prevents the model from encountering errors or unexpected behavior due to excessively long sequences during training or fine-tuning, ensuring consistent input dimensions and improving overall robustness. * Change exception message in src/transformers/models/whisper/modeling_whisper.py The exception message is for whisper's label's sequence max length. Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> * Change 448 to config.max_target_positions in src/transformers/models/whisper/modeling_whisper.py It's for whisper's config.max_target_positions. Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> * Change method's documentation in src/transformers/models/whisper/modeling_whisper.py * Add test for maximum label's sequence length in test_modeling_whisper.py * Add self to modeling_whisper.py * Update test_modeling_whisper.py with respect to automatic validations * Update modeling_whisper.py with respect to ci/circleci: check_code_quality * Update test_modeling_whisper.py with respect to ci/circleci: check_code_quality * Update test_modeling_whisper.py with respect to ci/circleci: tests_generate * Update test_modeling_whisper.py with respect to ci/circleci: tests_generate * Update test_modeling_whisper.py with respect to ci/circleci: check_code_quality * Separate test_labels_sequence_max_length tests in test_modeling_whisper.py * Update test_modeling_whisper.py with respect to ci/circleci: check_code_quality * Remove assert from test_modeling_whisper.py * Add max_target_positions to WhisperModelTester in test_modeling_whisper.py * Update test_modeling_whisper.py with respect to ci/circleci: check_code_quality * Update test_modeling_whisper.py with respect to ci/circleci: tests_generate * Update test_modeling_whisper.py * Change test_labels_sequence_max_length_error_after_changing_config in test_modeling_whisper.py * Change self.config.max_target_positions to self.max_target_positions modeling_whisper.py * Add new tests in test_modeling_whisper.py * Update test_modeling_whisper.py --------- Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com>

ylacombe reviewed Sep 3, 2024

View reviewed changes

src/transformers/models/whisper/modeling_whisper.py Outdated Show resolved Hide resolved

ylacombe reviewed Sep 4, 2024

View reviewed changes

src/transformers/models/whisper/modeling_whisper.py Outdated Show resolved Hide resolved

src/transformers/models/whisper/modeling_whisper.py Outdated Show resolved Hide resolved

src/transformers/models/whisper/modeling_whisper.py Outdated Show resolved Hide resolved

AmirMohammadFakhimi and others added 14 commits September 4, 2024 14:26

Change exception message in src/transformers/models/whisper/modeling_…

4330cc3

…whisper.py The exception message is for whisper's label's sequence max length. Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com>

Change 448 to config.max_target_positions in src/transformers/models/…

f12a3c2

…whisper/modeling_whisper.py It's for whisper's config.max_target_positions. Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com>

Change method's documentation in src/transformers/models/whisper/mode…

264cd5b

…ling_whisper.py

Add test for maximum label's sequence length in test_modeling_whisper.py

cd4db32

Merge branch 'huggingface:main' into patch-1

03db15f

Add self to modeling_whisper.py

53dea49

Update test_modeling_whisper.py with respect to automatic validations

933a058

Update modeling_whisper.py with respect to ci/circleci: check_code_qu…

2be9b27

…ality

Update test_modeling_whisper.py with respect to ci/circleci: check_co…

9c15e68

…de_quality

Update test_modeling_whisper.py with respect to ci/circleci: tests_ge…

81d704e

…nerate

Update test_modeling_whisper.py with respect to ci/circleci: tests_ge…

f5ae7ec

…nerate

Update test_modeling_whisper.py with respect to ci/circleci: check_co…

b586686

…de_quality

Separate test_labels_sequence_max_length tests in test_modeling_whisp…

8aa4832

…er.py

Update test_modeling_whisper.py with respect to ci/circleci: check_co…

8bf582f

…de_quality

ylacombe reviewed Sep 4, 2024

View reviewed changes

AmirMohammadFakhimi added 4 commits September 4, 2024 20:10

Remove assert from test_modeling_whisper.py

13baea3

Add max_target_positions to WhisperModelTester in test_modeling_whisp…

be3d4d5

…er.py

Update test_modeling_whisper.py with respect to ci/circleci: check_co…

6cf4d8d

…de_quality

Update test_modeling_whisper.py with respect to ci/circleci: tests_ge…

930a933

…nerate

ylacombe reviewed Sep 4, 2024

View reviewed changes

AmirMohammadFakhimi added 3 commits September 5, 2024 00:02

Update test_modeling_whisper.py

fb1cd5e

Change test_labels_sequence_max_length_error_after_changing_config in…

88cd92f

… test_modeling_whisper.py

Change self.config.max_target_positions to self.max_target_positions …

4b677d5

…modeling_whisper.py

AmirMohammadFakhimi added 2 commits September 5, 2024 00:32

Add new tests in test_modeling_whisper.py

bab1098

Update test_modeling_whisper.py

de273c0

ylacombe approved these changes Sep 5, 2024

View reviewed changes

ylacombe requested a review from LysandreJik September 5, 2024 09:30

LysandreJik reviewed Sep 5, 2024

View reviewed changes

src/transformers/models/whisper/modeling_whisper.py Show resolved Hide resolved

LysandreJik approved these changes Sep 5, 2024

View reviewed changes

LysandreJik merged commit 3314fe1 into huggingface:main Sep 6, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add validation for maximum sequence length in modeling_whisper.py #33196

Add validation for maximum sequence length in modeling_whisper.py #33196

AmirMohammadFakhimi commented Aug 29, 2024

LysandreJik commented Aug 30, 2024

ylacombe left a comment

ylacombe commented Sep 4, 2024

ylacombe left a comment

AmirMohammadFakhimi commented Sep 4, 2024 •

edited

Loading

ylacombe left a comment

ylacombe Sep 4, 2024

ylacombe left a comment

HuggingFaceDocBuilderDev commented Sep 4, 2024

ylacombe left a comment

jugal-sheth commented Nov 15, 2024

Add validation for maximum sequence length in modeling_whisper.py #33196

Add validation for maximum sequence length in modeling_whisper.py #33196

Conversation

AmirMohammadFakhimi commented Aug 29, 2024

What does this PR do?

Who can review?

LysandreJik commented Aug 30, 2024

ylacombe left a comment

Choose a reason for hiding this comment

ylacombe commented Sep 4, 2024

ylacombe left a comment

Choose a reason for hiding this comment

AmirMohammadFakhimi commented Sep 4, 2024 • edited Loading

ylacombe left a comment

Choose a reason for hiding this comment

ylacombe Sep 4, 2024

Choose a reason for hiding this comment

ylacombe left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 4, 2024

ylacombe left a comment

Choose a reason for hiding this comment

jugal-sheth commented Nov 15, 2024

AmirMohammadFakhimi commented Sep 4, 2024 •

edited

Loading