Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect timetstamps #2271

Open
thewh1teagle opened this issue Jun 30, 2024 · 5 comments · May be fixed by #2279
Open

Incorrect timetstamps #2271

thewh1teagle opened this issue Jun 30, 2024 · 5 comments · May be fixed by #2279

Comments

@thewh1teagle
Copy link
Contributor

thewh1teagle commented Jun 30, 2024

When transcribing the following file, the timestamps are incorrect.
As you can see the start timestamp of the second segment is the same as the end timestamp of the previous one, although there's a gap of few seconds between.

never.give.you.up.mp4
transcript.srt
1
00:00:00,000 --> 00:00:08,700
*music* I just wanna tell you how I'm feeling. Gotta make you understand.

2
00:00:08,700 --> 00:00:18,080
Never gonna give you up, never gonna let you down.

3
00:00:18,080 --> 00:00:25,300
Never gonna run around and...
transcript.json
[
    {
        "start": 0,
        "stop": 870,
        "text": " *music* I just wanna tell you how I'm feeling. Gotta make you understand."
    },
    {
        "start": 870,
        "stop": 1808,
        "text": " Never gonna give you up, never gonna let you down."
    },
    {
        "start": 1808,
        "stop": 2530,
        "text": " Never gonna run around and..."
    }
]
word_timestamps.json
[
    {
        "start": 0,
        "stop": 3,
        "text": ""
    },
    {
        "start": 3,
        "stop": 200,
        "text": " *music*"
    },
    {
        "start": 200,
        "stop": 211,
        "text": " I"
    },
    {
        "start": 211,
        "stop": 257,
        "text": " just"
    },
    {
        "start": 257,
        "stop": 314,
        "text": " wanna"
    },
    {
        "start": 314,
        "stop": 360,
        "text": " tell"
    },
    {
        "start": 360,
        "stop": 394,
        "text": " you"
    },
    {
        "start": 394,
        "stop": 428,
        "text": " how"
    },
    {
        "start": 428,
        "stop": 462,
        "text": " I'm"
    },
    {
        "start": 462,
        "stop": 576,
        "text": " feeling."
    },
    {
        "start": 576,
        "stop": 633,
        "text": " Gotta"
    },
    {
        "start": 633,
        "stop": 679,
        "text": " make"
    },
    {
        "start": 679,
        "stop": 713,
        "text": " you"
    },
    {
        "start": 713,
        "stop": 870,
        "text": " understand."
    },
    {
        "start": 870,
        "stop": 976,
        "text": " Never"
    },
    {
        "start": 976,
        "stop": 1082,
        "text": " gonna"
    },
    {
        "start": 1082,
        "stop": 1167,
        "text": " give"
    },
    {
        "start": 1167,
        "stop": 1231,
        "text": " you"
    },
    {
        "start": 1231,
        "stop": 1417,
        "text": " up,"
    },
    {
        "start": 1417,
        "stop": 1421,
        "text": " never"
    },
    {
        "start": 1421,
        "stop": 1527,
        "text": " gonna"
    },
    {
        "start": 1527,
        "stop": 1591,
        "text": " let"
    },
    {
        "start": 1591,
        "stop": 1655,
        "text": " you"
    },
    {
        "start": 1655,
        "stop": 1808,
        "text": " down."
    },
    {
        "start": 1808,
        "stop": 1924,
        "text": " Never"
    },
    {
        "start": 1924,
        "stop": 2040,
        "text": " gonna"
    },
    {
        "start": 2040,
        "stop": 2109,
        "text": " run"
    },
    {
        "start": 2109,
        "stop": 2266,
        "text": " around"
    },
    {
        "start": 2266,
        "stop": 2530,
        "text": " and..."
    }
]
bviksoe added a commit to bviksoe/whisper.cpp that referenced this issue Jul 3, 2024
Fixes ggerganov#2271

- Adds consecutive timestamps after end of last segment as the new starting ts
- Add these timestamp to output when "print-special" enabled
- Fixes fflush usage in live reporting

I was not able to test this with the special "token_timestamps" option.
@bviksoe bviksoe linked a pull request Jul 3, 2024 that will close this issue
@SimpleVictor
Copy link

SimpleVictor commented Jul 5, 2024

@thewh1teagle How did you generate the word_timestamps.json. Was there a specific param I need to pass?

@thewh1teagle
Copy link
Contributor Author

@SimpleVictor
See tazz4843/whisper-rs#156 (comment)
Basically you need to set max_len to how many characters you want, and enable split_on_word so it will keep the words instead of cutting in the middle and then just get the text segments

palana pushed a commit to palana/whisper.cpp that referenced this issue Jul 30, 2024
Fixes ggerganov#2271

- Adds consecutive timestamps after end of last segment as the new starting ts
- Add these timestamp to output when "print-special" enabled
- Fixes fflush usage in live reporting

I was not able to test this with the special "token_timestamps" option.
guitarflow pushed a commit to guitarflow/whisper.cpp that referenced this issue Aug 1, 2024
Fixes ggerganov#2271

- Adds consecutive timestamps after end of last segment as the new starting ts
- Add these timestamp to output when "print-special" enabled
- Fixes fflush usage in live reporting

I was not able to test this with the special "token_timestamps" option.
@thewh1teagle
Copy link
Contributor Author

I found another weird wrong timestamps when word timestamps enabled.

sam_altman.mp4

Open the details and search for "start": 764, and see that one segment after has smaller start timestamp.

transcript.json
[
    {
        "start": 0,
        "stop": 19,
        "text": ""
    },
    {
        "start": 19,
        "stop": 34,
        "text": " What"
    },
    {
        "start": 34,
        "stop": 48,
        "text": " do"
    },
    {
        "start": 48,
        "stop": 72,
        "text": " you"
    },
    {
        "start": 72,
        "stop": 112,
        "text": " think"
    },
    {
        "start": 112,
        "stop": 151,
        "text": " about"
    },
    {
        "start": 151,
        "stop": 191,
        "text": " like"
    },
    {
        "start": 191,
        "stop": 216,
        "text": " when"
    },
    {
        "start": 216,
        "stop": 248,
        "text": " Elon"
    },
    {
        "start": 248,
        "stop": 272,
        "text": " was"
    },
    {
        "start": 272,
        "stop": 336,
        "text": " causing"
    },
    {
        "start": 336,
        "stop": 384,
        "text": " calling"
    },
    {
        "start": 384,
        "stop": 408,
        "text": " for"
    },
    {
        "start": 408,
        "stop": 416,
        "text": " a"
    },
    {
        "start": 416,
        "stop": 456,
        "text": " pause"
    },
    {
        "start": 456,
        "stop": 474,
        "text": " on"
    },
    {
        "start": 474,
        "stop": 494,
        "text": " AI"
    },
    {
        "start": 764,
        "stop": 514,
        "text": " He"
    },
    {
        "start": 514,
        "stop": 530,
        "text": " was"
    },
    {
        "start": 530,
        "stop": 587,
        "text": " like"
    },
    {
        "start": 587,
        "stop": 670,
        "text": " starting"
    },
    {
        "start": 670,
        "stop": 711,
        "text": " then"
    },
    {
        "start": 711,
        "stop": 721,
        "text": " a"
    },
    {
        "start": 721,
        "stop": 762,
        "text": " GI"
    },
    {
        "start": 762,
        "stop": 815,
        "text": " company"
    },
    {
        "start": 815,
        "stop": 867,
        "text": " while"
    },
    {
        "start": 867,
        "stop": 888,
        "text": " he"
    },
    {
        "start": 888,
        "stop": 919,
        "text": " was"
    },
    {
        "start": 919,
        "stop": 971,
        "text": " doing"
    },
    {
        "start": 971,
        "stop": 1018,
        "text": " that"
    },
    {
        "start": 1104,
        "stop": 1257,
        "text": " Yeah,"
    },
    {
        "start": 1257,
        "stop": 1272,
        "text": " so"
    },
    {
        "start": 1272,
        "stop": 1310,
        "text": " didn't"
    },
    {
        "start": 1310,
        "stop": 1323,
        "text": " he"
    },
    {
        "start": 1323,
        "stop": 1357,
        "text": " start"
    },
    {
        "start": 1357,
        "stop": 1367,
        "text": " it"
    },
    {
        "start": 1367,
        "stop": 1414,
        "text": " like"
    },
    {
        "start": 1414,
        "stop": 1431,
        "text": " after"
    },
    {
        "start": 1431,
        "stop": 1446,
        "text": " he"
    },
    {
        "start": 1446,
        "stop": 1464,
        "text": " was"
    },
    {
        "start": 1464,
        "stop": 1512,
        "text": " calling"
    },
    {
        "start": 1512,
        "stop": 1532,
        "text": " for"
    },
    {
        "start": 1532,
        "stop": 1553,
        "text": " the"
    },
    {
        "start": 1553,
        "stop": 1605,
        "text": " pause."
    },
    {
        "start": 1605,
        "stop": 1628,
        "text": " I"
    },
    {
        "start": 1694,
        "stop": 1658,
        "text": " Think"
    },
    {
        "start": 1658,
        "stop": 1694,
        "text": " before"
    },
    {
        "start": 1694,
        "stop": 1712,
        "text": " but"
    },
    {
        "start": 1712,
        "stop": 1718,
        "text": " I"
    },
    {
        "start": 1718,
        "stop": 1748,
        "text": " don't"
    },
    {
        "start": 1748,
        "stop": 1772,
        "text": " know"
    },
    {
        "start": 1772,
        "stop": 1784,
        "text": " in"
    },
    {
        "start": 1784,
        "stop": 1803,
        "text": " any"
    },
    {
        "start": 1803,
        "stop": 1832,
        "text": " cases"
    },
    {
        "start": 1832,
        "stop": 1850,
        "text": " one"
    },
    {
        "start": 1850,
        "stop": 1866,
        "text": " of"
    },
    {
        "start": 1866,
        "stop": 1892,
        "text": " those"
    },
    {
        "start": 1892,
        "stop": 1910,
        "text": " you"
    },
    {
        "start": 1910,
        "stop": 1940,
        "text": " can't"
    },
    {
        "start": 1940,
        "stop": 1964,
        "text": " beat"
    },
    {
        "start": 1964,
        "stop": 1981,
        "text": " him"
    },
    {
        "start": 1981,
        "stop": 2006,
        "text": " join"
    },
    {
        "start": 2006,
        "stop": 2030,
        "text": " them"
    },
    {
        "start": 2030,
        "stop": 2084,
        "text": " things."
    },
    {
        "start": 2084,
        "stop": 2108,
        "text": " Um,"
    },
    {
        "start": 2108,
        "stop": 2126,
        "text": " I"
    },
    {
        "start": 2410,
        "stop": 2185,
        "text": " Think"
    },
    {
        "start": 2185,
        "stop": 2220,
        "text": " the"
    },
    {
        "start": 2220,
        "stop": 2315,
        "text": " instinct"
    },
    {
        "start": 2315,
        "stop": 2338,
        "text": " of"
    },
    {
        "start": 2338,
        "stop": 2430,
        "text": " saying"
    },
    {
        "start": 2430,
        "stop": 2461,
        "text": " like"
    },
    {
        "start": 2461,
        "stop": 2518,
        "text": " we've"
    },
    {
        "start": 2518,
        "stop": 2585,
        "text": " really"
    },
    {
        "start": 2585,
        "stop": 2620,
        "text": " got"
    },
    {
        "start": 2620,
        "stop": 2643,
        "text": " to"
    },
    {
        "start": 2643,
        "stop": 2714,
        "text": " figure"
    },
    {
        "start": 2714,
        "stop": 2756,
        "text": " out"
    },
    {
        "start": 2756,
        "stop": 2784,
        "text": " how"
    },
    {
        "start": 2784,
        "stop": 2820,
        "text": " to"
    },
    {
        "start": 2840,
        "stop": 2872,
        "text": " Make"
    },
    {
        "start": 2872,
        "stop": 2920,
        "text": " this"
    },
    {
        "start": 2920,
        "stop": 2970,
        "text": " safe"
    },
    {
        "start": 2970,
        "stop": 3008,
        "text": " and"
    },
    {
        "start": 3008,
        "stop": 3058,
        "text": " good"
    },
    {
        "start": 3058,
        "stop": 3096,
        "text": " and"
    },
    {
        "start": 3096,
        "stop": 3164,
        "text": " like"
    },
    {
        "start": 3164,
        "stop": 3222,
        "text": " widely"
    },
    {
        "start": 3222,
        "stop": 3272,
        "text": " good"
    },
    {
        "start": 3272,
        "stop": 3306,
        "text": " is"
    },
    {
        "start": 3306,
        "stop": 3454,
        "text": " really"
    },
    {
        "start": 3454,
        "stop": 3486,
        "text": " important"
    },
    {
        "start": 3486,
        "stop": 3524,
        "text": " but"
    },
    {
        "start": 3524,
        "stop": 3535,
        "text": " I"
    },
    {
        "start": 3535,
        "stop": 3606,
        "text": " think"
    },
    {
        "start": 3816,
        "stop": 3845,
        "text": " Calling"
    },
    {
        "start": 3845,
        "stop": 3977,
        "text": " for"
    },
    {
        "start": 3977,
        "stop": 4016,
        "text": " a"
    },
    {
        "start": 4108,
        "stop": 4078,
        "text": " Pause"
    },
    {
        "start": 4078,
        "stop": 4091,
        "text": " is"
    },
    {
        "start": 4091,
        "stop": 4133,
        "text": " like"
    },
    {
        "start": 4133,
        "stop": 4188,
        "text": " naive"
    },
    {
        "start": 4188,
        "stop": 4209,
        "text": " it"
    },
    {
        "start": 4209,
        "stop": 4230,
        "text": " at"
    },
    {
        "start": 4230,
        "stop": 4273,
        "text": " best"
    },
    {
        "start": 4273,
        "stop": 4337,
        "text": " for"
    },
    {
        "start": 4337,
        "stop": 4337,
        "text": " the"
    },
    {
        "start": 4337,
        "stop": 4402,
        "text": " latest"
    },
    {
        "start": 4402,
        "stop": 4446,
        "text": " tech"
    },
    {
        "start": 4446,
        "stop": 4543,
        "text": " insights"
    },
    {
        "start": 4543,
        "stop": 4586,
        "text": " visit"
    },
    {
        "start": 4586,
        "stop": 4693,
        "text": " em"
    },
    {
        "start": 4693,
        "stop": 4704,
        "text": " 360"
    },
    {
        "start": 4704,
        "stop": 4750,
        "text": " tech"
    },
    {
        "start": 4750,
        "stop": 4800,
        "text": " calm"
    },
    {
        "start": 4800,
        "stop": 4800,
        "text": ""
    },
    {
        "start": 4800,
        "stop": 4843,
        "text": " visit"
    },
    {
        "start": 4843,
        "stop": 5054,
        "text": " EM360tech.com."
    },
    {
        "start": 5054,
        "stop": 5054,
        "text": ""
    },
    {
        "start": 5054,
        "stop": 6054,
        "text": " [BLANK_AUDIO]"
    }
]

@ggerganov

Is there a way we can 'tell' whisper the segments instead of letting him segment it?
I'm trying to add diarization. But currently the timestamps of whisper.cpp is not entirely accurate. I already have accurate segmentation. but not sure if it will be efficient to execute whisper on segments (speeches) which probably will be shorter than 30s many times causing the whole transcribe to be slower?

The diarization is actually pretty simple and once I'll find an approach to use it along with whisper.cpp I can add it to whisper.cpp / implement in Rust.

https://github.com/thewh1teagle/ort-diarize/blob/main/main.py

@majisama
Copy link

What if I want to use it in jni? That is, what if I want to use it in Android?

@thewh1teagle
Copy link
Contributor Author

thewh1teagle commented Nov 7, 2024

@ggerganov

Can you take a look about this issue? it happens with all the models. You can easily understand the timestamps issue just by run this and listen for the few seconds audio:

cd $(mktemp -d)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build . 
cmake --build build
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.bin
wget https://github.com/thewh1teagle/vibe/releases/download/v2.6.3/normal.wav
./build/bin/main -m ggml-tiny.bin -f normal.wav
ffplay normal.wav

Seems like the issue is here. t0 updated to be latest t1. but the next start timestamp shouldn't always be the end timestamp of last segment:

whisper.cpp/src/whisper.cpp

Lines 6210 to 6217 in 31aea56

text = "";
while (i < (int) tokens_cur.size() && tokens_cur[i].id > whisper_token_beg(ctx)) {
i++;
}
i--;
t0 = t1;
i0 = i + 1;
speaker_turn_next = false;

Maybe it's limitation in the model
#330

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants