-
Notifications
You must be signed in to change notification settings - Fork 535
Closed
Description
We use audio chunker to do whisper inference in streaming manner.
Good chunker is important. At the high-level, it should be
- Max 30sec(Whisper constraint). Users want to see result in faster tempo. Targeting around 12se. Might need some scoring mechanism.
VAD_prob * buffer_length - Should split based on slience, should strip slience as much as possible. (Filter out silences #662) Whisper tends to hallucinate a lot on empty audio.
Our current approach:
chunker/stream.rs works with pluggable predictor.
Currently we use very simple RMS-based predictor:
Max-length constraint:
silero-rs is well-tested implementation.(Blog post)
We are not using it because it is hard to force 30sec max constraint. (emotechlab/silero-rs#31)
We have dataset to test if chunker works well or not.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels