-
We've been having some trouble syncing the output timestamps to the start of speech if it doesn't precisely coincide with the beginning of the file, or after a pause within the file, and I was wondering if anyone had insight into why this would be or ideas for how to work around it? Overall results have been very impressive, both accuracy and sync within speech, it's just starting points that are proving to be an issue. As a hypothetical example: take a file that begins with ten seconds of music, followed by the words "Four score and seven years ago" from timestamp Looking into the token-level timestamps with https://github.com/jianfch/stable-ts is interesting (big thank you to @jianfch!) - we'll see something like the following:
So the first couple of words are individually being timestamped before the start of speech, with gaps far too large, but then the later words come more or less into sync, and by the end of the line or phrase the timestamp of the last word is pretty much dead on. A potential workaround is just removing any non-speech audio with something like Silero VAD (as per #29) and stitching the files back together afterwards, but I'd still like to try and understand the issue first. Any insight would be very much appreciated! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 14 replies
-
the script uses the start and end of the segments as min and max to select the token-level timestamps. |
Beta Was this translation helpful? Give feedback.
the script uses the start and end of the segments as min and max to select the token-level timestamps.
unstable_word_timestamps
for that segment might reveal more as to what's going on. 'Four' and 'score' might even have timestamps that start after 10seconds.