Silero based audio chunker

We use audio chunker to do whisper inference in streaming manner. 

https://github.com/fastrepl/hyprnote/blob/573d5ef602b93638e7dc73a87ed64917d18bfc8a/plugins/local-stt/src/server.rs#L148-L150

Good chunker is important. At the high-level, it should be
- Max 30sec(Whisper constraint). Users want to see result in faster tempo. Targeting around 12se. Might need some scoring mechanism. `VAD_prob * buffer_length`
- Should split based on slience, should strip slience as much as possible. (#662) Whisper tends to hallucinate a lot on empty audio.

---

Our current approach:

[chunker/stream.rs](https://github.com/fastrepl/hyprnote/blob/573d5ef602b93638e7dc73a87ed64917d18bfc8a/crates/chunker/src/stream.rs#) works with pluggable predictor.

Currently we use very simple RMS-based predictor:

https://github.com/fastrepl/hyprnote/blob/573d5ef602b93638e7dc73a87ed64917d18bfc8a/crates/chunker/src/predictor.rs#L14-L25

Max-length constraint: 

https://github.com/fastrepl/hyprnote/blob/573d5ef602b93638e7dc73a87ed64917d18bfc8a/crates/chunker/src/stream.rs#L70

---

[silero-rs](https://github.com/emotechlab/silero-rs) is well-tested implementation.([Blog post](https://xd009642.github.io/2024/08/23/snapshot-testing-neural-networks.html))

We are not using it because it is hard to force 30sec max constraint. (https://github.com/emotechlab/silero-rs/issues/31)


https://github.com/emotechlab/silero-rs/blob/6e8637b9d06cac41bbfe47e9933289f16ecbf87f/src/lib.rs#L85-L99

https://github.com/emotechlab/silero-rs/blob/6e8637b9d06cac41bbfe47e9933289f16ecbf87f/src/lib.rs#L387-L399

---

We have dataset to test if chunker works well or not.

https://github.com/fastrepl/hyprnote/blob/573d5ef602b93638e7dc73a87ed64917d18bfc8a/crates/chunker/src/lib.rs#L33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silero based audio chunker #857

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Silero based audio chunker #857

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions