Add pyannote vad (segmentation) model #1197

thewh1teagle · 2024-07-31T15:54:09Z

I would like to use sherpa-onnx for speaker diarization. However the current vad modal (silero) doesn't works well and doesn't detect speech correctly.
I tried another onnx model in the project pengzhendong/pyannote-onnx and it detects much better.
It's based on onnx too.
Can we add this model for sherpa-onnx?

csukuangfj · 2024-08-01T02:22:20Z

Would you like to contribute?

thewh1teagle · 2024-08-01T11:03:53Z

Would you like to contribute?

Unfortunately, I haven't worked with onnxruntime before, so I'm not sure how to implement it. I assume it should work similarly to the implementation of silero vad?

csukuangfj · 2024-08-02T04:02:58Z

Would you like to contribute?

Unfortunately, I haven't worked with onnxruntime before, so I'm not sure how to implement it. I assume it should work similarly to the implementation of silero vad?

Ok, we can take a look but not this week. It may take sometime to add it.

thewh1teagle · 2024-08-02T19:11:20Z

Ok, we can take a look but not this week. It may take sometime to add it.

Meanwhile I created basic implementation in Python. It looks accurate

# python3 -m venv venv 
# source venv/bin/activate
# pip3 install onnxruntime numpy librosa
# wget https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -Otest.wav
# python3 main.py

import onnxruntime as ort
import librosa
import numpy as np

def init_session(model_path):
    opts = ort.SessionOptions()
    opts.inter_op_num_threads = 1
    opts.intra_op_num_threads = 1
    opts.log_severity_level = 3
    sess = ort.InferenceSession(model_path, sess_options=opts)
    return sess

def read_wav(path: str):
    return librosa.load(path, sr=16000)

if __name__ == '__main__':
    session = init_session('segmentation-3.0.onnx')
    samples, sample_rate = read_wav('test.wav')
    
    # Conv1d & MaxPool1d & SincNet https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html https://pytorch.org/docs/stable/generated/torch.nn.MaxPool1d.html https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/models/blocks/sincnet.py#L50-L71
    frame_size = 270
    frame_start = 721
    window_size = sample_rate * 10 # 10s

    # State and offset
    is_speeching = False
    offset = frame_start
    start_offset = 0

    # Pad end with silence for full last segment
    samples = np.pad(samples, (0, window_size), 'constant') 

    for start in range(0, len(samples), window_size):
        window = samples[start:start + window_size]
        ort_outs: np.array = session.run(None, {'input': window[None, None, :]})[0][0]
        for probs in ort_outs:
            predicted_id = np.argmax(probs)
            if predicted_id != 0:
                if not is_speeching:
                    start_offset = offset
                    is_speeching = True
            elif is_speeching:
                start = round(start_offset / sample_rate, 3)
                end = round(offset / sample_rate, 3)
                print(f'{start}s - {end}s')
                is_speeching = False
            offset += frame_size

diyism · 2024-09-13T15:50:21Z

@thewh1teagle , how accurate? could you do me a favor to test the sherpa-onnx/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav file with your code and paste all the "start-end" here?
I'm comparing syllable segmentation tools in this issue: #920 (comment)

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/kws-models/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01.tar.bz2

diyism · 2024-09-14T08:44:50Z

Sorry, I mis-understood, the segmentation-3.0.onnx is not a syllable-level VAD, it can only detect the beginning and end of a sentence:

The thetaOscillator-syllable-segmentation is better (but not enough good):
#920 (comment)

csukuangfj · 2024-10-09T09:14:16Z

Fixed in the latest master

diyism · 2024-11-03T19:09:35Z

@thewh1teagle @csukuangfj
I'm wrong, the pyannote segmentation-3.0.onnx indeed can segment syllables(mandarin pinyin),
for sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01_test_wavs_4.wav,
it can segment the first 7 syllables, but the last 5 syllables are not so accurate:
https://github.com/diyism/pyannote_segment_syllables

$ git clone https://github.com/diyism/pyannote_segment_syllables
$ cd pyannote_segment_syllables/
$ python main.py sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01_test_wavs_4.wav
Found 12 syllables:
0.560s - 0.742s
0.742s - 1.066s
1.066s - 1.298s
1.645s - 1.920s
2.035s - 2.203s
2.203s - 2.470s
2.555s - 2.725s
2.725s - 2.960s
3.150s - 3.250s
3.250s - 3.475s
3.550s - 3.760s
3.760s - 3.975s
Saved syllable 001: 0.560s - 0.742s (duration: 0.182s)
Saved syllable 002: 0.742s - 1.066s (duration: 0.324s)
Saved syllable 003: 1.066s - 1.298s (duration: 0.232s)
Saved syllable 004: 1.645s - 1.920s (duration: 0.275s)
Saved syllable 005: 2.035s - 2.203s (duration: 0.167s)
Saved syllable 006: 2.203s - 2.470s (duration: 0.267s)
Saved syllable 007: 2.555s - 2.725s (duration: 0.170s)
Saved syllable 008: 2.725s - 2.960s (duration: 0.235s)
Saved syllable 009: 3.150s - 3.250s (duration: 0.100s)
Saved syllable 010: 3.250s - 3.475s (duration: 0.225s)
Saved syllable 011: 3.550s - 3.760s (duration: 0.210s)
Saved syllable 012: 3.760s - 3.975s (duration: 0.215s)

$ aplay syllables/001.wav
$ aplay syllables/002.wav
$ aplay syllables/003.wav

Maybe you can help me to improve it.

I guess that since the segmentation-3.0.onnx can segment syllables(mandarin pinyin), maybe a very small model can recognize all the 1300 mono-syllable pinyins. While the segmentation-3.0.onnx is only 5.8MB.

And if this tool get improved, maybe it can help build a streamlined custom training process for sherpa-onnx-kws(#1371), so that we users only need to record our own voices(consist of all the 1300 pinyins) to train a custom model.

thewh1teagle · 2024-11-03T19:17:55Z

@diyism
I recommend running your tests using the pyannote-audio library.
If issues persist, it's likely a problem within the model itself, or at least you can open issue there or investigate it better.

thewh1teagle mentioned this issue Aug 1, 2024

[Feature Request]: Speaker labels (Diarization) thewh1teagle/vibe#74

Closed

csukuangfj closed this as completed Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pyannote vad (segmentation) model #1197

Add pyannote vad (segmentation) model #1197

thewh1teagle commented Jul 31, 2024

csukuangfj commented Aug 1, 2024

thewh1teagle commented Aug 1, 2024

csukuangfj commented Aug 2, 2024

thewh1teagle commented Aug 2, 2024 •

edited

Loading

diyism commented Sep 13, 2024 •

edited

Loading

diyism commented Sep 14, 2024 •

edited

Loading

csukuangfj commented Oct 9, 2024

diyism commented Nov 3, 2024 •

edited

Loading

thewh1teagle commented Nov 3, 2024

Add pyannote vad (segmentation) model #1197

Add pyannote vad (segmentation) model #1197

Comments

thewh1teagle commented Jul 31, 2024

csukuangfj commented Aug 1, 2024

thewh1teagle commented Aug 1, 2024

csukuangfj commented Aug 2, 2024

thewh1teagle commented Aug 2, 2024 • edited Loading

diyism commented Sep 13, 2024 • edited Loading

diyism commented Sep 14, 2024 • edited Loading

csukuangfj commented Oct 9, 2024

diyism commented Nov 3, 2024 • edited Loading

thewh1teagle commented Nov 3, 2024

thewh1teagle commented Aug 2, 2024 •

edited

Loading

diyism commented Sep 13, 2024 •

edited

Loading

diyism commented Sep 14, 2024 •

edited

Loading

diyism commented Nov 3, 2024 •

edited

Loading