Incorrect timetstamps #2271

thewh1teagle · 2024-06-30T22:09:34Z

When transcribing the following file, the timestamps are incorrect.
As you can see the start timestamp of the second segment is the same as the end timestamp of the previous one, although there's a gap of few seconds between.

never.give.you.up.mp4

transcript.srt

1
00:00:00,000 --> 00:00:08,700
*music* I just wanna tell you how I'm feeling. Gotta make you understand.

2
00:00:08,700 --> 00:00:18,080
Never gonna give you up, never gonna let you down.

3
00:00:18,080 --> 00:00:25,300
Never gonna run around and...

transcript.json

[
    {
        "start": 0,
        "stop": 870,
        "text": " *music* I just wanna tell you how I'm feeling. Gotta make you understand."
    },
    {
        "start": 870,
        "stop": 1808,
        "text": " Never gonna give you up, never gonna let you down."
    },
    {
        "start": 1808,
        "stop": 2530,
        "text": " Never gonna run around and..."
    }
]

word_timestamps.json

[
    {
        "start": 0,
        "stop": 3,
        "text": ""
    },
    {
        "start": 3,
        "stop": 200,
        "text": " *music*"
    },
    {
        "start": 200,
        "stop": 211,
        "text": " I"
    },
    {
        "start": 211,
        "stop": 257,
        "text": " just"
    },
    {
        "start": 257,
        "stop": 314,
        "text": " wanna"
    },
    {
        "start": 314,
        "stop": 360,
        "text": " tell"
    },
    {
        "start": 360,
        "stop": 394,
        "text": " you"
    },
    {
        "start": 394,
        "stop": 428,
        "text": " how"
    },
    {
        "start": 428,
        "stop": 462,
        "text": " I'm"
    },
    {
        "start": 462,
        "stop": 576,
        "text": " feeling."
    },
    {
        "start": 576,
        "stop": 633,
        "text": " Gotta"
    },
    {
        "start": 633,
        "stop": 679,
        "text": " make"
    },
    {
        "start": 679,
        "stop": 713,
        "text": " you"
    },
    {
        "start": 713,
        "stop": 870,
        "text": " understand."
    },
    {
        "start": 870,
        "stop": 976,
        "text": " Never"
    },
    {
        "start": 976,
        "stop": 1082,
        "text": " gonna"
    },
    {
        "start": 1082,
        "stop": 1167,
        "text": " give"
    },
    {
        "start": 1167,
        "stop": 1231,
        "text": " you"
    },
    {
        "start": 1231,
        "stop": 1417,
        "text": " up,"
    },
    {
        "start": 1417,
        "stop": 1421,
        "text": " never"
    },
    {
        "start": 1421,
        "stop": 1527,
        "text": " gonna"
    },
    {
        "start": 1527,
        "stop": 1591,
        "text": " let"
    },
    {
        "start": 1591,
        "stop": 1655,
        "text": " you"
    },
    {
        "start": 1655,
        "stop": 1808,
        "text": " down."
    },
    {
        "start": 1808,
        "stop": 1924,
        "text": " Never"
    },
    {
        "start": 1924,
        "stop": 2040,
        "text": " gonna"
    },
    {
        "start": 2040,
        "stop": 2109,
        "text": " run"
    },
    {
        "start": 2109,
        "stop": 2266,
        "text": " around"
    },
    {
        "start": 2266,
        "stop": 2530,
        "text": " and..."
    }
]

Fixes ggerganov#2271 - Adds consecutive timestamps after end of last segment as the new starting ts - Add these timestamp to output when "print-special" enabled - Fixes fflush usage in live reporting I was not able to test this with the special "token_timestamps" option.

SimpleVictor · 2024-07-05T17:20:37Z

@thewh1teagle How did you generate the word_timestamps.json. Was there a specific param I need to pass?

thewh1teagle · 2024-07-05T17:33:08Z

@SimpleVictor
See tazz4843/whisper-rs#156 (comment)
Basically you need to set max_len to how many characters you want, and enable split_on_word so it will keep the words instead of cutting in the middle and then just get the text segments

Fixes ggerganov#2271 - Adds consecutive timestamps after end of last segment as the new starting ts - Add these timestamp to output when "print-special" enabled - Fixes fflush usage in live reporting I was not able to test this with the special "token_timestamps" option.

thewh1teagle · 2024-08-02T22:01:53Z

I found another weird wrong timestamps when word timestamps enabled.

sam_altman.mp4

Open the details and search for "start": 764, and see that one segment after has smaller start timestamp.

transcript.json

[
    {
        "start": 0,
        "stop": 19,
        "text": ""
    },
    {
        "start": 19,
        "stop": 34,
        "text": " What"
    },
    {
        "start": 34,
        "stop": 48,
        "text": " do"
    },
    {
        "start": 48,
        "stop": 72,
        "text": " you"
    },
    {
        "start": 72,
        "stop": 112,
        "text": " think"
    },
    {
        "start": 112,
        "stop": 151,
        "text": " about"
    },
    {
        "start": 151,
        "stop": 191,
        "text": " like"
    },
    {
        "start": 191,
        "stop": 216,
        "text": " when"
    },
    {
        "start": 216,
        "stop": 248,
        "text": " Elon"
    },
    {
        "start": 248,
        "stop": 272,
        "text": " was"
    },
    {
        "start": 272,
        "stop": 336,
        "text": " causing"
    },
    {
        "start": 336,
        "stop": 384,
        "text": " calling"
    },
    {
        "start": 384,
        "stop": 408,
        "text": " for"
    },
    {
        "start": 408,
        "stop": 416,
        "text": " a"
    },
    {
        "start": 416,
        "stop": 456,
        "text": " pause"
    },
    {
        "start": 456,
        "stop": 474,
        "text": " on"
    },
    {
        "start": 474,
        "stop": 494,
        "text": " AI"
    },
    {
        "start": 764,
        "stop": 514,
        "text": " He"
    },
    {
        "start": 514,
        "stop": 530,
        "text": " was"
    },
    {
        "start": 530,
        "stop": 587,
        "text": " like"
    },
    {
        "start": 587,
        "stop": 670,
        "text": " starting"
    },
    {
        "start": 670,
        "stop": 711,
        "text": " then"
    },
    {
        "start": 711,
        "stop": 721,
        "text": " a"
    },
    {
        "start": 721,
        "stop": 762,
        "text": " GI"
    },
    {
        "start": 762,
        "stop": 815,
        "text": " company"
    },
    {
        "start": 815,
        "stop": 867,
        "text": " while"
    },
    {
        "start": 867,
        "stop": 888,
        "text": " he"
    },
    {
        "start": 888,
        "stop": 919,
        "text": " was"
    },
    {
        "start": 919,
        "stop": 971,
        "text": " doing"
    },
    {
        "start": 971,
        "stop": 1018,
        "text": " that"
    },
    {
        "start": 1104,
        "stop": 1257,
        "text": " Yeah,"
    },
    {
        "start": 1257,
        "stop": 1272,
        "text": " so"
    },
    {
        "start": 1272,
        "stop": 1310,
        "text": " didn't"
    },
    {
        "start": 1310,
        "stop": 1323,
        "text": " he"
    },
    {
        "start": 1323,
        "stop": 1357,
        "text": " start"
    },
    {
        "start": 1357,
        "stop": 1367,
        "text": " it"
    },
    {
        "start": 1367,
        "stop": 1414,
        "text": " like"
    },
    {
        "start": 1414,
        "stop": 1431,
        "text": " after"
    },
    {
        "start": 1431,
        "stop": 1446,
        "text": " he"
    },
    {
        "start": 1446,
        "stop": 1464,
        "text": " was"
    },
    {
        "start": 1464,
        "stop": 1512,
        "text": " calling"
    },
    {
        "start": 1512,
        "stop": 1532,
        "text": " for"
    },
    {
        "start": 1532,
        "stop": 1553,
        "text": " the"
    },
    {
        "start": 1553,
        "stop": 1605,
        "text": " pause."
    },
    {
        "start": 1605,
        "stop": 1628,
        "text": " I"
    },
    {
        "start": 1694,
        "stop": 1658,
        "text": " Think"
    },
    {
        "start": 1658,
        "stop": 1694,
        "text": " before"
    },
    {
        "start": 1694,
        "stop": 1712,
        "text": " but"
    },
    {
        "start": 1712,
        "stop": 1718,
        "text": " I"
    },
    {
        "start": 1718,
        "stop": 1748,
        "text": " don't"
    },
    {
        "start": 1748,
        "stop": 1772,
        "text": " know"
    },
    {
        "start": 1772,
        "stop": 1784,
        "text": " in"
    },
    {
        "start": 1784,
        "stop": 1803,
        "text": " any"
    },
    {
        "start": 1803,
        "stop": 1832,
        "text": " cases"
    },
    {
        "start": 1832,
        "stop": 1850,
        "text": " one"
    },
    {
        "start": 1850,
        "stop": 1866,
        "text": " of"
    },
    {
        "start": 1866,
        "stop": 1892,
        "text": " those"
    },
    {
        "start": 1892,
        "stop": 1910,
        "text": " you"
    },
    {
        "start": 1910,
        "stop": 1940,
        "text": " can't"
    },
    {
        "start": 1940,
        "stop": 1964,
        "text": " beat"
    },
    {
        "start": 1964,
        "stop": 1981,
        "text": " him"
    },
    {
        "start": 1981,
        "stop": 2006,
        "text": " join"
    },
    {
        "start": 2006,
        "stop": 2030,
        "text": " them"
    },
    {
        "start": 2030,
        "stop": 2084,
        "text": " things."
    },
    {
        "start": 2084,
        "stop": 2108,
        "text": " Um,"
    },
    {
        "start": 2108,
        "stop": 2126,
        "text": " I"
    },
    {
        "start": 2410,
        "stop": 2185,
        "text": " Think"
    },
    {
        "start": 2185,
        "stop": 2220,
        "text": " the"
    },
    {
        "start": 2220,
        "stop": 2315,
        "text": " instinct"
    },
    {
        "start": 2315,
        "stop": 2338,
        "text": " of"
    },
    {
        "start": 2338,
        "stop": 2430,
        "text": " saying"
    },
    {
        "start": 2430,
        "stop": 2461,
        "text": " like"
    },
    {
        "start": 2461,
        "stop": 2518,
        "text": " we've"
    },
    {
        "start": 2518,
        "stop": 2585,
        "text": " really"
    },
    {
        "start": 2585,
        "stop": 2620,
        "text": " got"
    },
    {
        "start": 2620,
        "stop": 2643,
        "text": " to"
    },
    {
        "start": 2643,
        "stop": 2714,
        "text": " figure"
    },
    {
        "start": 2714,
        "stop": 2756,
        "text": " out"
    },
    {
        "start": 2756,
        "stop": 2784,
        "text": " how"
    },
    {
        "start": 2784,
        "stop": 2820,
        "text": " to"
    },
    {
        "start": 2840,
        "stop": 2872,
        "text": " Make"
    },
    {
        "start": 2872,
        "stop": 2920,
        "text": " this"
    },
    {
        "start": 2920,
        "stop": 2970,
        "text": " safe"
    },
    {
        "start": 2970,
        "stop": 3008,
        "text": " and"
    },
    {
        "start": 3008,
        "stop": 3058,
        "text": " good"
    },
    {
        "start": 3058,
        "stop": 3096,
        "text": " and"
    },
    {
        "start": 3096,
        "stop": 3164,
        "text": " like"
    },
    {
        "start": 3164,
        "stop": 3222,
        "text": " widely"
    },
    {
        "start": 3222,
        "stop": 3272,
        "text": " good"
    },
    {
        "start": 3272,
        "stop": 3306,
        "text": " is"
    },
    {
        "start": 3306,
        "stop": 3454,
        "text": " really"
    },
    {
        "start": 3454,
        "stop": 3486,
        "text": " important"
    },
    {
        "start": 3486,
        "stop": 3524,
        "text": " but"
    },
    {
        "start": 3524,
        "stop": 3535,
        "text": " I"
    },
    {
        "start": 3535,
        "stop": 3606,
        "text": " think"
    },
    {
        "start": 3816,
        "stop": 3845,
        "text": " Calling"
    },
    {
        "start": 3845,
        "stop": 3977,
        "text": " for"
    },
    {
        "start": 3977,
        "stop": 4016,
        "text": " a"
    },
    {
        "start": 4108,
        "stop": 4078,
        "text": " Pause"
    },
    {
        "start": 4078,
        "stop": 4091,
        "text": " is"
    },
    {
        "start": 4091,
        "stop": 4133,
        "text": " like"
    },
    {
        "start": 4133,
        "stop": 4188,
        "text": " naive"
    },
    {
        "start": 4188,
        "stop": 4209,
        "text": " it"
    },
    {
        "start": 4209,
        "stop": 4230,
        "text": " at"
    },
    {
        "start": 4230,
        "stop": 4273,
        "text": " best"
    },
    {
        "start": 4273,
        "stop": 4337,
        "text": " for"
    },
    {
        "start": 4337,
        "stop": 4337,
        "text": " the"
    },
    {
        "start": 4337,
        "stop": 4402,
        "text": " latest"
    },
    {
        "start": 4402,
        "stop": 4446,
        "text": " tech"
    },
    {
        "start": 4446,
        "stop": 4543,
        "text": " insights"
    },
    {
        "start": 4543,
        "stop": 4586,
        "text": " visit"
    },
    {
        "start": 4586,
        "stop": 4693,
        "text": " em"
    },
    {
        "start": 4693,
        "stop": 4704,
        "text": " 360"
    },
    {
        "start": 4704,
        "stop": 4750,
        "text": " tech"
    },
    {
        "start": 4750,
        "stop": 4800,
        "text": " calm"
    },
    {
        "start": 4800,
        "stop": 4800,
        "text": ""
    },
    {
        "start": 4800,
        "stop": 4843,
        "text": " visit"
    },
    {
        "start": 4843,
        "stop": 5054,
        "text": " EM360tech.com."
    },
    {
        "start": 5054,
        "stop": 5054,
        "text": ""
    },
    {
        "start": 5054,
        "stop": 6054,
        "text": " [BLANK_AUDIO]"
    }
]

@ggerganov

Is there a way we can 'tell' whisper the segments instead of letting him segment it?
I'm trying to add diarization. But currently the timestamps of whisper.cpp is not entirely accurate. I already have accurate segmentation. but not sure if it will be efficient to execute whisper on segments (speeches) which probably will be shorter than 30s many times causing the whole transcribe to be slower?

The diarization is actually pretty simple and once I'll find an approach to use it along with whisper.cpp I can add it to whisper.cpp / implement in Rust.

https://github.com/thewh1teagle/ort-diarize/blob/main/main.py

majisama · 2024-10-31T04:57:41Z

What if I want to use it in jni? That is, what if I want to use it in Android?

thewh1teagle · 2024-11-07T07:11:06Z

@ggerganov

Can you take a look about this issue? it happens with all the models. You can easily understand the timestamps issue just by run this and listen for the few seconds audio:

cd $(mktemp -d)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build . 
cmake --build build
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.bin
wget https://github.com/thewh1teagle/vibe/releases/download/v2.6.3/normal.wav
./build/bin/main -m ggml-tiny.bin -f normal.wav
ffplay normal.wav

Seems like the issue is here. t0 updated to be latest t1. but the next start timestamp shouldn't always be the end timestamp of last segment:

whisper.cpp/src/whisper.cpp

Lines 6210 to 6217 in 31aea56

    
           text = ""; 
        
           while (i < (int) tokens_cur.size() && tokens_cur[i].id > whisper_token_beg(ctx)) { 
        
               i++; 
        
           } 
        
           i--; 
        
           t0 = t1; 
        
           i0 = i + 1; 
        
           speaker_turn_next = false;

Maybe it's limitation in the model
#330

thewh1teagle mentioned this issue Jun 30, 2024

[Feature Request]: Mark pauses thewh1teagle/vibe#152

Closed

bviksoe linked a pull request Jul 3, 2024 that will close this issue

Incorrect timestamps #2279

Open

thewh1teagle mentioned this issue Jul 5, 2024

How to compute token-level timestamps? #2283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect timetstamps #2271

Incorrect timetstamps #2271

thewh1teagle commented Jun 30, 2024 •

edited

Loading

SimpleVictor commented Jul 5, 2024 •

edited

Loading

thewh1teagle commented Jul 5, 2024

thewh1teagle commented Aug 2, 2024

majisama commented Oct 31, 2024

thewh1teagle commented Nov 7, 2024 •

edited

Loading

Incorrect timetstamps #2271

Incorrect timetstamps #2271

Comments

thewh1teagle commented Jun 30, 2024 • edited Loading

SimpleVictor commented Jul 5, 2024 • edited Loading

thewh1teagle commented Jul 5, 2024

thewh1teagle commented Aug 2, 2024

majisama commented Oct 31, 2024

thewh1teagle commented Nov 7, 2024 • edited Loading

thewh1teagle commented Jun 30, 2024 •

edited

Loading

SimpleVictor commented Jul 5, 2024 •

edited

Loading

thewh1teagle commented Nov 7, 2024 •

edited

Loading