Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range in add_word_timestamps function #1118

Closed
formater opened this issue Nov 6, 2024 · 10 comments · Fixed by #1157
Closed

IndexError: list index out of range in add_word_timestamps function #1118

formater opened this issue Nov 6, 2024 · 10 comments · Fixed by #1157

Comments

@formater
Copy link

formater commented Nov 6, 2024

Hi,
I found a rare condition, with a specific wav file, specific language and prompt, when I try to transcribe with word_timestamps=True, there is a list index out of range error in add_word_timestamps function:

  File "/usr/local/src/transcriber/lib/python3.11/site-packages/faster_whisper/transcribe.py", line 1574, in add_word_timestamps
    median_duration, max_duration = median_max_durations[segment_idx]
                                    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
IndexError: list index out of range

It seems in the median_max_durations list we have less elements than in the segments list.

I'm using large-v3-turbo model with these transcibe settings:

segments, _ = asr_model.transcribe(audio_to_analize, language="fr", condition_on_previous_text=False, initial_prompt="Free", task='transcribe', word_timestamps=True, suppress_tokens=[-1, 12], beam_size=5) 
segments = list(segments)  # The transcription will actually run here.

As I see, the median_max_durations is populated from alignments, so something is maybe wrong there? If i change language or prompt, or use another sound file, then there is no issue.

Thank you

@MahmoudAshraf97
Copy link
Collaborator

I'm aware that this error exists but I had no luck in reproducing it, can you write the exact steps to reproduce and upload the audio file?

@formater
Copy link
Author

formater commented Nov 6, 2024

Yes. The sample python code that generates the issue:

import torch
from faster_whisper import WhisperModel

asr_model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8", download_root="./models")
segments, _ = asr_model.transcribe('test.wav',  language='fr', condition_on_previous_text=False, initial_prompt='Free', task='transcribe', word_timestamps=True, suppress_tokens=[-1, 12], beam_size=5)
segments = list(segments)  # The transcription will actually run here.

And the audio sample is attached.
test.zip

@MahmoudAshraf97
Copy link
Collaborator

I was not able to reproduce it on my machine or using colab

@formater
Copy link
Author

formater commented Nov 6, 2024

Maybe python version, debian, pytorch... or something is slightly different on our setups. Anything I can do on my side to get more debug logs to see what is the issue?

@MahmoudAshraf97
Copy link
Collaborator

are you using the master branch?
median_max_durations is initialized as an empty list, and since you are using sequential transcription, it will have a single value, The only reason that causes this error is that it is still an empty list which means the for loop in line 1565 was never executed, this will happen when alignments is an empty list, you need to figure why is this happening

alignments = self.find_alignment(
tokenizer, text_tokens, encoder_output, num_frames
)
median_max_durations = []
for alignment in alignments:
word_durations = np.array(
[word["end"] - word["start"] for word in alignment]
)
word_durations = word_durations[word_durations.nonzero()]
median_duration = (
np.median(word_durations) if len(word_durations) > 0 else 0.0
)
median_duration = min(0.7, float(median_duration))
max_duration = median_duration * 2
# hack: truncate long words at sentence boundaries.
# a better segmentation algorithm based on VAD should be able to replace this.
if len(word_durations) > 0:
sentence_end_marks = ".。!!??"
# ensure words at sentence boundaries
# are not longer than twice the median word duration.
for i in range(1, len(alignment)):
if alignment[i]["end"] - alignment[i]["start"] > max_duration:
if alignment[i]["word"] in sentence_end_marks:
alignment[i]["end"] = alignment[i]["start"] + max_duration
elif alignment[i - 1]["word"] in sentence_end_marks:
alignment[i]["start"] = alignment[i]["end"] - max_duration
merge_punctuations(alignment, prepend_punctuations, append_punctuations)
median_max_durations.append((median_duration, max_duration))
for segment_idx, segment in enumerate(segments):
word_index = 0
time_offset = segment[0]["start"]
median_duration, max_duration = median_max_durations[segment_idx]

@krmao
Copy link

krmao commented Nov 14, 2024

the same here, while test whisper_streaming

Traceback (most recent call last):
  File "C:\Users\kr.mao\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "C:\Users\kr.mao\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "F:\Workspace\skills\python3\whisper_streaming\whisper_online_server.py", line 183, in <module>
    proc.process()
  File "F:\Workspace\skills\python3\whisper_streaming\whisper_online_server.py", line 162, in process
    o = online.process_iter()
  File "F:\Workspace\skills\python3\whisper_streaming\whisper_online.py", line 378, in process_iter
    res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)
  File "F:\Workspace\skills\python3\whisper_streaming\whisper_online.py", line 138, in transcribe
    return list(segments)
  File "F:\Workspace\skills\python3\whisper_streaming\venv\lib\site-packages\faster_whisper\transcribe.py", line 2016, in restore_speech_timestamps
    for segment in segments:
  File "F:\Workspace\skills\python3\whisper_streaming\venv\lib\site-packages\faster_whisper\transcribe.py", line 1256, in generate_segments
    self.add_word_timestamps(
  File "F:\Workspace\skills\python3\whisper_streaming\venv\lib\site-packages\faster_whisper\transcribe.py", line 1595, in add_word_timestamps
    median_duration, max_duration = median_max_durations[segment_idx]
IndexError: list index out of range

faster_whisper version.py

"""Version information."""

__version__ = "1.1.0rc0"

@MahmoudAshraf97
Copy link
Collaborator

This problem is still non-reproducible regardless of all methods provided, it will not be solved without reproduction, someone who has the problem needs to create a colab notebook to reproduce it and if they weren't able to reproduce it on colab then they need to isolate where the problem is caused in their environment, without that there is nothing that can be done

@OliveSerg
Copy link

OliveSerg commented Nov 19, 2024

This problem is still non-reproducible regardless of all methods provided, it will not be solved without reproduction, someone who has the problem needs to create a colab notebook to reproduce it and if they weren't able to reproduce it on colab then they need to isolate where the problem is caused in their environment, without that there is nothing that can be done

https://gist.github.com/OliveSerg/cc6c409126567a40c94eb94339a13bae

Was able to reproduce it on Colab with the following files test.zip. Was not able to reproduce with @formater's test file though. Files are just a French bible verse from LibriVox and a youtube short.

Used ctranslate2==4.4.0 because of 1806.

Error occurs only when compute_type="int8" or int8_float16, task="translate", and word_timestamps=True. No further debugging with the parameters were done aside for replacing these 3.

@Purfview
Copy link
Contributor

@MahmoudAshraf97

Maybe related to such weird output (that's from prebug 193 revision of faster-whisper):

    {
        "id": 279,
        "seek": 132430,
        "start": 1542.84,
        "end": 1545.14,
        "text": " Nuðarr你可以 það hverðesskj af april",
        "tokens": [51225, 13612, 23436, 289, 81, 42766, 43219, 64, 23436, 276, 331, 23436, 442, 74, 73, 3238, 10992, 388, 51350],
        "temperature": 1.0,
        "avg_logprob": -4.741359252929687,
        "compression_ratio": 1.335164835164835,
        "no_speech_prob": 0.12347412109375,
        "words": [
            {"start": 1542.84, "end": 1542.84, "word": "af", "probability": 0.002758026123046875},
            {"start": 1542.84, "end": 1542.84, "word": "aprilð", "probability": 0.057145535945892334},
            {"start": 1542.84, "end": 1542.84, "word": "jævîr", "probability": 0.1567896842956543},
            {"start": 1542.84, "end": 1542.84, "word": "til", "probability": 0.0018939971923828125},
            {"start": 1542.84, "end": 1542.84, "word": "det", "probability": 0.0033779144287109375},
            {"start": 1542.84, "end": 1543.44, "word": "bældat", "probability": 0.11750292778015137},
            {"start": 1543.44, "end": 1544.36, "word": "brilliant", "probability": 7.152557373046875e-07},
            {"start": 1544.36, "end": 1545.14, "word": "með", "probability": 0.2783784866333008}
        ]
    },
    {
        "id": 280,
        "seek": 132430,
        "start": 1541.32,
        "end": 1543.04,
        "text": "ð jævîr til det bældat brilliant með",
        "tokens": [51350, 23436, 361, 7303, 85, 7517, 81, 8440, 1141, 272, 7303, 348, 267, 10248, 385, 23436, 51436],
        "temperature": 1.0,
        "avg_logprob": -4.741359252929687,
        "compression_ratio": 1.335164835164835,
        "no_speech_prob": 0.12347412109375,
        "words": []
    },
    {
        "id": 281,
        "seek": 135430,
        "start": 1545.14,
        "end": 1546.3,
        "text": " Duð ena porgna prákankenin.",
        "tokens": [50364, 5153, 23436, 465, 64, 1515, 70, 629, 582, 842, 5225, 2653, 259, 13, 50431],
        "temperature": 1.0,
        "avg_logprob": -4.655551255031784,
        "compression_ratio": 1.3051771117166213,
        "no_speech_prob": 0.036651611328125,
        "words": [
            {"start": 1545.14, "end": 1545.36, "word": "Duð", "probability": 0.051422119140625},
            {"start": 1545.36, "end": 1545.36, "word": "ena", "probability": 0.010187149047851562},
            {"start": 1545.36, "end": 1545.44, "word": "porgna", "probability": 0.004482746124267578},
            {"start": 1545.44, "end": 1546.3, "word": "prákankenin.", "probability": 0.04590331315994263}
        ]
    }

@MahmoudAshraf97
Copy link
Collaborator

https://gist.github.com/OliveSerg/cc6c409126567a40c94eb94339a13bae

Was able to reproduce it on Colab with the following files test.zip. Was not able to reproduce with @formater's test file though. Files are just a French bible verse from LibriVox and a youtube short.

Used ctranslate2==4.4.0 because of 1806.

Error occurs only when compute_type="int8" or int8_float16, task="translate", and word_timestamps=True. No further debugging with the parameters were done aside for replacing these 3.

I managed to reproduce it consistently on colab, I also reproduced it on my machine but not consistently, the reason for inconsistency is that it needs the exact encoder input and generated tokens to reproduce, and using int8 does not guarantee that at least on my hardware(RTX 3070 Ti) so I have to try transcribing several times to reproduce.

What causes the issue is that some segments produce a single timestamp token with no text tokens and that's it, find_alignment function returned an empty list when no words were found which was fine before #856 , but after it, we're expecting find_alignment to return a list of lists which happens as long as there are text tokens, but in the edge case where it doesn't it returned a single list and ignores the rest of the loop over other segments in the batch, hence returning less alignments than segments causing the list index out of range error

I'll open a PR to solve the problem soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants