Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word-level timestamps? #27

Open
antiboredom opened this issue Dec 14, 2023 · 7 comments
Open

word-level timestamps? #27

antiboredom opened this issue Dec 14, 2023 · 7 comments

Comments

@antiboredom
Copy link

Hi - thanks for making this. I was trying to get word-level timestamps, but haven't been able to figure out how to. Any tips? Thanks again!

@absadiki
Copy link
Owner

Hi @antiboredom,
You are welcome :) 'Glad you found it useful.

To achieve word-level timestamps, you will need to enable token_timestamps and set max_len to 1, like the following:

from pywhispercpp.model import Model

model = Model('base.en', n_threads=6)
words = model.transcribe('file.mp3', token_timestamps=True, max_len=1)
for word in words:
    print(word.text)

@antiboredom
Copy link
Author

Thank you! not sure why I was having trouble sorting that out myself!!

One more thing, and I'm not sure if this is just a whisper thing or related to your project, but I'm seeing one longer word being broken up. In my test case, "Enormous" is becoming "En", "orm", "ous". Any ideas why that might be happening?

@absadiki
Copy link
Owner

it's a bit tricky to figure it out, as it is not an exact word-level timestamp per say, in fact you can set the max_len to whatever number of chars you want, so when you set max_len to 1, every token will be in its own line, and it will give similar results to a word-level timestamps.

And I think this is the problem with your test case, it seems like "Enormous" is tokenized into 3 tokens, and you get every token by its own. Although, I've never get such a case!

Can you try for example to change the max_len to 8 for example ?

@antiboredom
Copy link
Author

Interesting! When I try max_len set to 8, I get "Enorm" and "ous", and then occasionally multiple words like "and if" appearing on the same line... I have also tried faster-whisper which does work as expected for word-level timestamps, but is significantly slower than your implementation...

@absadiki
Copy link
Owner

You still get two separate words from "Enormous" even after max_len set to 8, interesting test case!
Could you please share the audio file with me, I would like to test it on my end ?

Yes Faster-whisper is great and should give you good results and it should be as fast as well, at least when I test it a while ago! But I didn't compare the performance of the two implementations to be honest.

@dkakaie
Copy link

dkakaie commented Aug 19, 2024

@antiboredom @abdeladim-s I think you might want to try out the sow -aka split on word- option from whisper.cpp. I'm not sure really but I think it concatenates tokens not starting with a whitespace, thus keeping tokens forming a single word together. So, you may want -ml 1 -sow options together.

shortened output from ./main -h whisper.cpp repo:
-sow, --split-on-word [false ] split on word rather than on token

@absadiki
Copy link
Owner

@dkakaie, I couldn't reproduce the issue with my test files at that time, but yes, you're probably right.
The split_on_word can be used as a parameter in the transcribe function as well.

Thanks @dkakaie for pointing that out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants