Detecting timestamp for the start of speech within a file #237

gtebbutt · 2022-10-03T21:01:59Z

gtebbutt
Oct 3, 2022

We've been having some trouble syncing the output timestamps to the start of speech if it doesn't precisely coincide with the beginning of the file, or after a pause within the file, and I was wondering if anyone had insight into why this would be or ideas for how to work around it? Overall results have been very impressive, both accuracy and sync within speech, it's just starting points that are proving to be an issue.

As a hypothetical example: take a file that begins with ten seconds of music, followed by the words "Four score and seven years ago" from timestamp 00:00:10 to 00:00:14. The output we're seeing would look like [00:00:00 --> 00:00:14] Four score and seven years ago - but we actually want that first timestamp to be 00:00:10 to avoid the subtitle hanging on screen for ten seconds before anyone begins talking. We're seeing the same after pauses that happen within the file, but the start and end times for a given line within continuous speech are spot on.

Looking into the token-level timestamps with https://github.com/jianfch/stable-ts is interesting (big thank you to @jianfch!) - we'll see something like the following:

[00:00:00.500] Four
[00:00:06.000] score
[00:00:10.300] and
[00:00:10.870] seven
...

So the first couple of words are individually being timestamped before the start of speech, with gaps far too large, but then the later words come more or less into sync, and by the end of the line or phrase the timestamp of the last word is pretty much dead on.

A potential workaround is just removing any non-speech audio with something like Silero VAD (as per #29) and stitching the files back together afterwards, but I'd still like to try and understand the issue first. Any insight would be very much appreciated!

Answered by jianfch

Oct 3, 2022

the script uses the start and end of the segments as min and max to select the token-level timestamps. unstable_word_timestamps for that segment might reveal more as to what's going on. 'Four' and 'score' might even have timestamps that start after 10seconds.

View full answer

jianfch · 2022-10-03T21:33:26Z

jianfch
Oct 3, 2022

the script uses the start and end of the segments as min and max to select the token-level timestamps. unstable_word_timestamps for that segment might reveal more as to what's going on. 'Four' and 'score' might even have timestamps that start after 10seconds.

14 replies

a-ruban Nov 15, 2022

@arnavmehta7 why? It gives an opportunity to increase accuracy of timecodes before Aenas stumble upon noisy fragment and brake timecodes. In the worst case - you get the same timecodes that Whisper gives you, in the general case - you improve timecodes for part of the audio, in the best - improve for the whole audio.

usmanagha125 Jan 25, 2023

@a-ruban Could you please share (maybe the code) how you used Aeneas to get word-level timestamps on the plain text Whisper transcription. I've tried getting word-level timestamps using stable_whisper but the results are not very accurate. It splits some words into half. I've been trying to use Aeneas as a python library to get word-level timestamps. I am basically building an Automatic Dubbing Software as my final year project

andupotorac Jan 26, 2023

vel t

Use this for word-level timestamps: https://github.com/m-bain/whisperX

luisbnzsa Oct 18, 2023

I used one feature that is on the repository in the tips section "use demucs=True to isolate vocals with Demucs; it is also effective at isolating vocals even if there is no music". In one of the tests I had, there was music for about 40 seconds and the timestamps at the end was 10 seconds off. After using demucs=True, it was correct.

LaurinmyReha Sep 6, 2024

This delivered the best word level timestamps (including fillers and pauses) in our experiments.

https://github.com/nyrahealth/CrisperWhisper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting timestamp for the start of speech within a file #237

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 14 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Detecting timestamp for the start of speech within a file #237

gtebbutt Oct 3, 2022

Replies: 1 comment · 14 replies

jianfch Oct 3, 2022

a-ruban Nov 15, 2022

usmanagha125 Jan 25, 2023

andupotorac Jan 26, 2023

luisbnzsa Oct 18, 2023

LaurinmyReha Sep 6, 2024

gtebbutt
Oct 3, 2022

Replies: 1 comment 14 replies

jianfch
Oct 3, 2022