-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pyannote vad (segmentation) model #1197
Comments
Would you like to contribute? |
Unfortunately, I haven't worked with onnxruntime before, so I'm not sure how to implement it. I assume it should work similarly to the implementation of silero vad? |
Ok, we can take a look but not this week. It may take sometime to add it. |
Meanwhile I created basic implementation in Python. It looks accurate # python3 -m venv venv
# source venv/bin/activate
# pip3 install onnxruntime numpy librosa
# wget https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -Otest.wav
# python3 main.py
import onnxruntime as ort
import librosa
import numpy as np
def init_session(model_path):
opts = ort.SessionOptions()
opts.inter_op_num_threads = 1
opts.intra_op_num_threads = 1
opts.log_severity_level = 3
sess = ort.InferenceSession(model_path, sess_options=opts)
return sess
def read_wav(path: str):
return librosa.load(path, sr=16000)
if __name__ == '__main__':
session = init_session('segmentation-3.0.onnx')
samples, sample_rate = read_wav('test.wav')
# Conv1d & MaxPool1d & SincNet https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html https://pytorch.org/docs/stable/generated/torch.nn.MaxPool1d.html https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/models/blocks/sincnet.py#L50-L71
frame_size = 270
frame_start = 721
window_size = sample_rate * 10 # 10s
# State and offset
is_speeching = False
offset = frame_start
start_offset = 0
# Pad end with silence for full last segment
samples = np.pad(samples, (0, window_size), 'constant')
for start in range(0, len(samples), window_size):
window = samples[start:start + window_size]
ort_outs: np.array = session.run(None, {'input': window[None, None, :]})[0][0]
for probs in ort_outs:
predicted_id = np.argmax(probs)
if predicted_id != 0:
if not is_speeching:
start_offset = offset
is_speeching = True
elif is_speeching:
start = round(start_offset / sample_rate, 3)
end = round(offset / sample_rate, 3)
print(f'{start}s - {end}s')
is_speeching = False
offset += frame_size |
@thewh1teagle , how accurate? could you do me a favor to test the sherpa-onnx/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav file with your code and paste all the "start-end" here? |
Sorry, I mis-understood, the segmentation-3.0.onnx is not a syllable-level VAD, it can only detect the beginning and end of a sentence: The thetaOscillator-syllable-segmentation is better (but not enough good): |
Fixed in the latest master |
@thewh1teagle @csukuangfj
Maybe you can help me to improve it. I guess that since the segmentation-3.0.onnx can segment syllables(mandarin pinyin), maybe a very small model can recognize all the 1300 mono-syllable pinyins. While the segmentation-3.0.onnx is only 5.8MB. And if this tool get improved, maybe it can help build a streamlined custom training process for sherpa-onnx-kws(#1371), so that we users only need to record our own voices(consist of all the 1300 pinyins) to train a custom model. |
@diyism |
I would like to use sherpa-onnx for speaker diarization. However the current vad modal (silero) doesn't works well and doesn't detect speech correctly.
I tried another onnx model in the project pengzhendong/pyannote-onnx and it detects much better.
It's based on onnx too.
Can we add this model for sherpa-onnx?
The text was updated successfully, but these errors were encountered: