No punctuation for the first 75 minutes of the video. What could be the error / bug? #194

FurkanGozukara · 2022-09-29T20:54:48Z

FurkanGozukara
Sep 29, 2022

Hello. I am generating subtitles for my this video : https://www.youtube.com/watch?v=77iDUQd4x90

I have provided the video file directly to the wisher with language en and model large

However as can be seen from the below screenshots, there is no punctuation in the transcription for first 75 minutes of the video :

But after that, the punctuation starts

I am using the latest updated version of Whisper on Windows computer with Python Version 3.9.9

Answered by jongwook

Sep 29, 2022

Being an autoregressive model, Whisper has a certain chance to get stuck into a "no-punctuation mode". It also seems to be correlated with the tendency to create either precise timestamps you see in the first screenshot or integer timestamps you see in the second. You could try giving --initial_prompt "Hello, welcome to my lecture." to nudge the model to weigh more on the "with-punctuation mode".

View full answer

FurkanGozukara · 2022-09-29T20:57:00Z

FurkanGozukara
Sep 29, 2022
Author

@jongwook @drdaxxy @cool-RR @gglanzani If you could check I would appreciate very much

0 replies

jongwook · 2022-09-29T20:59:20Z

jongwook
Sep 29, 2022
Maintainer

Being an autoregressive model, Whisper has a certain chance to get stuck into a "no-punctuation mode". It also seems to be correlated with the tendency to create either precise timestamps you see in the first screenshot or integer timestamps you see in the second. You could try giving --initial_prompt "Hello, welcome to my lecture." to nudge the model to weigh more on the "with-punctuation mode".

8 replies

jongwook Sep 29, 2022
Maintainer

It's a hacky but simplest interface to steer the model behavior. Welcome to the world of prompt engineering 😂. A more serious way for doing that would be to fine-tune the model with the transcripts with punctuations, but it's more cumbersome.

FurkanGozukara Sep 29, 2022
Author

It's a hacky but simplest interface to steer the model behavior. Welcome to the world of prompt engineering 😂. A more serious way for doing that would be to fine-tune the model with the transcripts with punctuations, but it's more cumbersome.

Ye it got broken again after some time and got fixed again after some time. I wish there was a way to force it all the time do punctuation.

ryanheise Oct 26, 2022

Being an autoregressive model, Whisper has a certain chance to get stuck into a "no-punctuation mode". It also seems to be correlated with the tendency to create either precise timestamps you see in the first screenshot or integer timestamps you see in the second. You could try giving --initial_prompt "Hello, welcome to my lecture." to nudge the model to weigh more on the "with-punctuation mode".

That's really fascinating. It's almost as though Whisper has a split personality disorder :-)

I wonder, would it be possible to include non-integer timestamps in the prompt to nudge it toward accurate timestamps, and to also include punctuation in the prompt to get it to do both punctuation and accurate timestamps at the same time? Or would accurate timestamps tend to switch off the punctuation anyway?

HaeChan0305 Dec 22, 2022

@jongwook I'm struggled with same problem, no-punctuation mode. Why do you think that punctuation mode working is correlated with the tendency to create either precise timestamps?

Sogl Sep 21, 2023

This solution helps for a while, and then it's back to the mode without punctuation and sentences.

FurkanGozukara · 2022-09-30T08:05:29Z

FurkanGozukara
Sep 30, 2022
Author

@jongwook I am 100% there must be a way to force it to be on punctuation mode 100% time

Yesterday I did various tests and finally I made it stay

After converting audio into mp3, medium en model with a hypothetical sentence processed this video of mine with 100% punctuation

https://www.youtube.com/watch?v=77iDUQd4x90

If only I knew Python I am sure I could figure out but I am C# guy

Here 2 transcribes files for you to compare what I mean

1 : MP3 formatted input, large model, with a hypothetical sentence, in the beginning punctuation on then it becomes off then becomes on again : https://docs.google.com/document/d/12lo_Utex7dpM1qLHnxYjsYTcLYzzYuCOLkbhP6KZH3U/edit?usp=sharing

2 : MP3 formatted input, medium en model, with a hypothetical sentence, 100% time punctuation : https://docs.google.com/document/d/1j1fTf_h-086mHHfCp74GbW2e6DIqo9GCqUlkKZAG0nQ/edit?usp=sharing

Both models got exactly same input and 1 of them worked with punctuation on and another didn't

I really need help on this thank you very much

1 reply

brendankntb Oct 20, 2024

I have a similar situation. Transcribing with Rubio model and the initial prompt and there is no punctuation. Using the exact same code changed to medium model produces a transcription with punctuation. I haven’t been able o find a way to get the turbo model to work with punctuation.

FurkanGozukara · 2022-09-30T12:06:33Z

FurkanGozukara
Sep 30, 2022
Author

I also noticed very weird something. At some parts of the video, I make computer to read some text.

For example here at the minute 11:42 i start computer voice. As you can imagine it has perfect English. Just open subtitles it is generated by Whisper.

https://youtu.be/77iDUQd4x90?t=702

However, model medium.en didn't generate any text for that part. It generated almost perfect text for my speech but that part is missing.

This behaviour repeats. Sometimes entire speech of computer is missing and sometimes only some part of it.

I have used this command to generate this transcription

whisper input.mp3 --model medium.en --language en --initial_prompt "Welcome to the Software Engineering Courses channel."

@jongwook

4 replies

jongwook Sep 30, 2022
Maintainer

The model sometimes gets confused by a sudden change of voice and may drop the second speaker's part due to the training data that didn't contain captions for interviewees or chatter in the background. There is unfortunately no surefire way to do this, but you might get a different behavior by adjusting the beam size (to have more candidates which is likely to contain the second speaker) and length penalty (to penalize less on longer transcription), or even by giving --suppress_tokens "" which may output unwanted symbols but possibly be more likely to transcribe the second speaker.

FurkanGozukara Sep 30, 2022
Author

Second speaker is speaking alone in my video. So either speaker 1 speaks or speaker 2 speaks. I will test --suppress_tokens "" and beam size. What beam size you suggest like 10 15 20? thank you very much. also when we run default cmd it is 5 right?

jongwook Sep 30, 2022
Maintainer

So either speaker 1 speaks or speaker 2 speaks.

Right, even if they are not simultaneously speaking, the model might get offguarded and (wrongfully) decide to drop the second speaker.

What beam size you suggest like 10 15 20?

I know it sounds disappointing but again there's no correct answer on this; we observed for some audio larger beam size is better, for some 5 worked best and it began to degrade for larger beam sizes.

when we run default cmd it is 5 right?

Yes

FurkanGozukara Sep 30, 2022
Author

Dear @jongwook , could be that --suppress_tokens "" causing a bug? Because now medium model uses much more ram and it generated 0 output 🗡️

edit : beam size 10 captured that missing part above. still processing. i will compare once it is done. ty

edit 2 : beam 10 seriously decreased punctuation quality. so not useful for me :(

FurkanGozukara · 2022-10-01T17:40:35Z

FurkanGozukara
Oct 1, 2022
Author

Perhaps we could implement this into the whisper optionally? That can process output of whisper and save as another output?

https://github.com/xashru/punctuation-restoration

0 replies

ANonEntity · 2022-10-01T21:04:30Z

ANonEntity
Oct 1, 2022

I suspect the reason it drops the punctuation is this:

whisper/whisper/transcribe.py

Lines 235 to 237 in 0b1ba3d

    
           if not condition_on_previous_text or result.temperature > 0.5: 
        
               # do not feed the prompt tokens if a high temperature was used 
        
               prompt_reset_since = len(all_tokens)

The model struggles with a segment, resets the prompt, and then decides to go without punctuation from there. I've made a pull request (#220) that I think might solve it.

6 replies

FurkanGozukara Oct 1, 2022
Author

I suspect the reason it drops the punctuation is this:

whisper/whisper/transcribe.py

Lines 235 to 237 in 0b1ba3d

if not condition_on_previous_text or result.temperature > 0.5:

# do not feed the prompt tokens if a high temperature was used

prompt_reset_since = len(all_tokens)

The model struggles with a segment, resets the prompt, and then decides to go without punctuation from there. I've made a pull request (#220) that I think might solve it.

unfortunately it failed for me :(

ANonEntity Oct 1, 2022

No, that's not the fix. You need to add these two lines after "decode_options["prompt"] = all_tokens[prompt_reset_since:]":

whisper/whisper/transcribe.py

Lines 178 to 179 in 34f971e

    
           if len(decode_options["prompt"]) == 0 and initial_prompt and condition_on_previous_text: 
        
               decode_options["prompt"] = initial_prompt

FurkanGozukara Oct 1, 2022
Author

No, that's not the fix. You need to add these two lines after "decode_options["prompt"] = all_tokens[prompt_reset_since:]":

whisper/whisper/transcribe.py

Lines 178 to 179 in 34f971e

if len(decode_options["prompt"]) == 0 and initial_prompt and condition_on_previous_text:

decode_options["prompt"] = initial_prompt

I copy pasted entire file from your link and still not working

here the transcribe file i am using :

transcribe.txt

ANonEntity Oct 1, 2022

Looks like that's not the issue, then. Oh well 🙁

FurkanGozukara Oct 1, 2022
Author

Looks like that's not the issue, then. Oh well 🙁

here the mp3 file i am testing : https://drive.google.com/file/d/1HrT_CnZzifQ9JFihGLWk7YQo0WvEO-Wq/view?usp=sharing

and this the command : whisper --model medium.en --lang en

it is broken after like 40 minutes. actually it is mixed broken and mixed working

it also has some glitch

gudh · 2022-10-14T09:35:28Z

gudh
Oct 14, 2022

Sorry to interrupt

But is it possible for whisper to punctate the transcript alone?

I mean, if I downloaded the auto-generated subtitles from a youtube video with youtube-dl, is it possible to punctate the subtitles with whisper?

2 replies

jongwook Oct 17, 2022
Maintainer

Not out of the box. If the two transcripts are sufficiently similar, you could match them using an edit distance metric and apply punctuations. Alternatively, you could try a language model like GPT-3 to do the job without having to deal with the audio:

FurkanGozukara Oct 17, 2022
Author

only if it was free :(

mayeaux · 2023-04-12T15:22:56Z

mayeaux
Apr 12, 2023

Strangely enough, if I pass an initial prompt of Hello. for English content, it fixes previously punctuation-less content even if the word 'Hello' never appears. It seems like with some content the ASR recognizes it as 'crazy punctuation free content' (happens to me with content that jumps around in the narration or has bad quality), perhaps some of the training data was punctuation free like that and it determines that it's 'one of those types of content'. Adding the Hello. seems to indicate to Whisper that actually this isn't a crazy punctuation free piece of content it's a piece of content that should be punctuated.

2 replies

otakutyrant Jan 31, 2024

Now Hello become a myterial dark magic that worked for me.

ThioJoe Mar 8, 2024

Found this thread through google, but the Hello. initial prompt seemed to work for me as well.

mayeaux · 2023-04-25T20:29:04Z

mayeaux
Apr 25, 2023

This is a hack solution I came up with today:

mayeaux/faster-whisper@dda1795

It could use some refining but honestly it works well for me.

2 replies

4drawing95 Apr 27, 2023

I have a question. My coding skills are pretty limited, so I'm not sure. How do you reflect that update? I modified that translate.py, but it only increases the printed output, no change in the actual output. do I need to do something extra?

Saccarab Jan 7, 2024

This works fairly well when condition on prev text is enabled however, I've come across cases like below where when whisper actually decides to hallucinate these supplied tokens punctuations show themselves over and over again. so overall seems to be a little problematic still.

FurkanGozukara · 2024-01-07T12:50:27Z

FurkanGozukara
Jan 7, 2024
Author

I started using GPT4 to fix punctuation :D

2 replies

ejentos May 24, 2024

@FurkanGozukara do you use GPT4 whisper or adjust already transcribed audio without punctuation and ask GTP4 to add it?

FurkanGozukara May 24, 2024
Author

@FurkanGozukara do you use GPT4 whisper or adjust already transcribed audio without punctuation and ask GTP4 to add it?

i transcribe first with Whisper
then i fix transcription

sjtu-hxj · 2024-04-15T06:41:15Z

sjtu-hxj
Apr 15, 2024

This maybe works: --initial_prompt "Please do not forget the punctuations!"

0 replies

Zigunov · 2024-08-11T18:26:30Z

Zigunov
Aug 11, 2024

Hey all! I found that my “—initial_prompt” would work for a short time. And it would stop after a while. I found that adjusting the prompt by adding in 1 more sentence or word that has punctuation or full stop managed to circumvent the repetitive failure loop of getting the same transcript. It would almost be an entirely new transcript.

Hope this helps!

1 reply

Zigunov Aug 12, 2024

Update on this again, I’ve had a very high success rate by making the initial prompt be text taken from a text file called “initialprompt.txt” and I’ve automated it so that anytime there’s a failure, it takes a new sentence from a “sentence bank” text file (with lots of various sentences with punctuation, taken from previous transcripts) and adds it to the initial prompt text file to add variety for whisper to look at the new initial prompt as if it is brand new! Then it re-transcribes and so far, I’ve had 100% success with this tactic.

No punctuation for the first 75 minutes of the video. What could be the error / bug? #194

Replies: 12 comments · 28 replies

FurkanGozukara Sep 29, 2022 Author

jongwook Sep 29, 2022 Maintainer

jongwook Sep 29, 2022 Maintainer

FurkanGozukara Sep 29, 2022 Author

FurkanGozukara Sep 30, 2022 Author

FurkanGozukara Sep 30, 2022 Author

jongwook Sep 30, 2022 Maintainer

FurkanGozukara Sep 30, 2022 Author

jongwook Sep 30, 2022 Maintainer

FurkanGozukara Sep 30, 2022 Author

FurkanGozukara Oct 1, 2022 Author

FurkanGozukara Oct 1, 2022 Author

FurkanGozukara Oct 1, 2022 Author

FurkanGozukara Oct 1, 2022 Author

jongwook Oct 17, 2022 Maintainer

FurkanGozukara Oct 17, 2022 Author

FurkanGozukara Jan 7, 2024 Author

FurkanGozukara May 24, 2024 Author

Replies: 12 comments 28 replies

FurkanGozukara
Sep 29, 2022
Author

jongwook
Sep 29, 2022
Maintainer

jongwook Sep 29, 2022
Maintainer

FurkanGozukara Sep 29, 2022
Author

FurkanGozukara
Sep 30, 2022
Author

FurkanGozukara
Sep 30, 2022
Author

jongwook Sep 30, 2022
Maintainer

FurkanGozukara Sep 30, 2022
Author

jongwook Sep 30, 2022
Maintainer

FurkanGozukara Sep 30, 2022
Author

FurkanGozukara
Oct 1, 2022
Author

FurkanGozukara Oct 1, 2022
Author

FurkanGozukara Oct 1, 2022
Author

FurkanGozukara Oct 1, 2022
Author

jongwook Oct 17, 2022
Maintainer

FurkanGozukara Oct 17, 2022
Author

FurkanGozukara
Jan 7, 2024
Author

FurkanGozukara May 24, 2024
Author