Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pyannote vad (segmentation) model #1197

Closed
thewh1teagle opened this issue Jul 31, 2024 · 9 comments
Closed

Add pyannote vad (segmentation) model #1197

thewh1teagle opened this issue Jul 31, 2024 · 9 comments

Comments

@thewh1teagle
Copy link
Contributor

I would like to use sherpa-onnx for speaker diarization. However the current vad modal (silero) doesn't works well and doesn't detect speech correctly.
I tried another onnx model in the project pengzhendong/pyannote-onnx and it detects much better.
It's based on onnx too.
Can we add this model for sherpa-onnx?

@csukuangfj
Copy link
Collaborator

Would you like to contribute?

@thewh1teagle
Copy link
Contributor Author

Would you like to contribute?

Unfortunately, I haven't worked with onnxruntime before, so I'm not sure how to implement it. I assume it should work similarly to the implementation of silero vad?

@csukuangfj
Copy link
Collaborator

Would you like to contribute?

Unfortunately, I haven't worked with onnxruntime before, so I'm not sure how to implement it. I assume it should work similarly to the implementation of silero vad?

Ok, we can take a look but not this week. It may take sometime to add it.

@thewh1teagle
Copy link
Contributor Author

thewh1teagle commented Aug 2, 2024

Ok, we can take a look but not this week. It may take sometime to add it.

Meanwhile I created basic implementation in Python. It looks accurate

# python3 -m venv venv 
# source venv/bin/activate
# pip3 install onnxruntime numpy librosa
# wget https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -Otest.wav
# python3 main.py

import onnxruntime as ort
import librosa
import numpy as np

def init_session(model_path):
    opts = ort.SessionOptions()
    opts.inter_op_num_threads = 1
    opts.intra_op_num_threads = 1
    opts.log_severity_level = 3
    sess = ort.InferenceSession(model_path, sess_options=opts)
    return sess

def read_wav(path: str):
    return librosa.load(path, sr=16000)

if __name__ == '__main__':
    session = init_session('segmentation-3.0.onnx')
    samples, sample_rate = read_wav('test.wav')
    
    # Conv1d & MaxPool1d & SincNet https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html https://pytorch.org/docs/stable/generated/torch.nn.MaxPool1d.html https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/models/blocks/sincnet.py#L50-L71
    frame_size = 270
    frame_start = 721
    window_size = sample_rate * 10 # 10s

    # State and offset
    is_speeching = False
    offset = frame_start
    start_offset = 0

    # Pad end with silence for full last segment
    samples = np.pad(samples, (0, window_size), 'constant') 

    for start in range(0, len(samples), window_size):
        window = samples[start:start + window_size]
        ort_outs: np.array = session.run(None, {'input': window[None, None, :]})[0][0]
        for probs in ort_outs:
            predicted_id = np.argmax(probs)
            if predicted_id != 0:
                if not is_speeching:
                    start_offset = offset
                    is_speeching = True
            elif is_speeching:
                start = round(start_offset / sample_rate, 3)
                end = round(offset / sample_rate, 3)
                print(f'{start}s - {end}s')
                is_speeching = False
            offset += frame_size

@diyism
Copy link
Contributor

diyism commented Sep 13, 2024

@thewh1teagle , how accurate? could you do me a favor to test the sherpa-onnx/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav file with your code and paste all the "start-end" here?
I'm comparing syllable segmentation tools in this issue: #920 (comment)

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/kws-models/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01.tar.bz2

@diyism
Copy link
Contributor

diyism commented Sep 14, 2024

Sorry, I mis-understood, the segmentation-3.0.onnx is not a syllable-level VAD, it can only detect the beginning and end of a sentence:
Screenshot 2024-09-14 at 16-40-05 2024-09-14-163940_1920x1080_scrot png (PNG Image 1920 × 1080 pixels) — Scaled (82%)

The thetaOscillator-syllable-segmentation is better (but not enough good):
#920 (comment)

@csukuangfj
Copy link
Collaborator

Fixed in the latest master

@diyism
Copy link
Contributor

diyism commented Nov 3, 2024

@thewh1teagle @csukuangfj
I'm wrong, the pyannote segmentation-3.0.onnx indeed can segment syllables(mandarin pinyin),
for sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01_test_wavs_4.wav,
it can segment the first 7 syllables, but the last 5 syllables are not so accurate:
https://github.com/diyism/pyannote_segment_syllables

$ git clone https://github.com/diyism/pyannote_segment_syllables
$ cd pyannote_segment_syllables/
$ python main.py sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01_test_wavs_4.wav
Found 12 syllables:
0.560s - 0.742s
0.742s - 1.066s
1.066s - 1.298s
1.645s - 1.920s
2.035s - 2.203s
2.203s - 2.470s
2.555s - 2.725s
2.725s - 2.960s
3.150s - 3.250s
3.250s - 3.475s
3.550s - 3.760s
3.760s - 3.975s
Saved syllable 001: 0.560s - 0.742s (duration: 0.182s)
Saved syllable 002: 0.742s - 1.066s (duration: 0.324s)
Saved syllable 003: 1.066s - 1.298s (duration: 0.232s)
Saved syllable 004: 1.645s - 1.920s (duration: 0.275s)
Saved syllable 005: 2.035s - 2.203s (duration: 0.167s)
Saved syllable 006: 2.203s - 2.470s (duration: 0.267s)
Saved syllable 007: 2.555s - 2.725s (duration: 0.170s)
Saved syllable 008: 2.725s - 2.960s (duration: 0.235s)
Saved syllable 009: 3.150s - 3.250s (duration: 0.100s)
Saved syllable 010: 3.250s - 3.475s (duration: 0.225s)
Saved syllable 011: 3.550s - 3.760s (duration: 0.210s)
Saved syllable 012: 3.760s - 3.975s (duration: 0.215s)

$ aplay syllables/001.wav
$ aplay syllables/002.wav
$ aplay syllables/003.wav

Maybe you can help me to improve it.

I guess that since the segmentation-3.0.onnx can segment syllables(mandarin pinyin), maybe a very small model can recognize all the 1300 mono-syllable pinyins. While the segmentation-3.0.onnx is only 5.8MB.

And if this tool get improved, maybe it can help build a streamlined custom training process for sherpa-onnx-kws(#1371), so that we users only need to record our own voices(consist of all the 1300 pinyins) to train a custom model.

@thewh1teagle
Copy link
Contributor Author

@diyism
I recommend running your tests using the pyannote-audio library.
If issues persist, it's likely a problem within the model itself, or at least you can open issue there or investigate it better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants