Speech not detected by silero vad #1084

thewh1teagle · 2024-07-07T22:17:11Z

Hey
First of all, thanks for this great library! I like it a lot.

I created Rust bindings, and while creating the bindings for voice activity detection, I noticed that sometimes it doesn't detect speech although it's there, loud and clear.

So I checked original silero-vad repository and compared it with sherpa-onnx on the same audio file.
Then I noticed that it doesn't detect the speech only with sherpa-onnx but it does detect it with torch.

Reproduce:

Download audio file

wget https://github.com/thewh1teagle/sherpa-rs/raw/main/samples/motivation.wav

Test silero vad with sherpa-onnx on the file:

main.py

# wget https://github.com/snakers4/silero-vad/raw/master/files/silero_vad.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/raw/main/samples/motivation.wav
# pip3 install soundfile numpy sherpa_onnx
# python3 main.py

from pathlib import Path
from typing import Tuple

import numpy as np
import sherpa_onnx
import soundfile as sf 


def load_audio(filename: str) -> Tuple[np.ndarray, int]:
    data, sample_rate = sf.read(
        filename,
        always_2d=True,
        dtype="float32",
    )
    data = data[:, 0]  # use only the first channel
    samples = np.ascontiguousarray(data)
    
    # Add 1 seconds of padding to the end of the samples
    padding_samples = int(sample_rate * 1)
    samples = np.concatenate((samples, np.zeros(padding_samples, dtype=samples.dtype)))
    
    return samples, sample_rate

def main():
    samples, sample_rate = load_audio("motivation.wav")
    config = sherpa_onnx.VadModelConfig()
    config.silero_vad.model = "silero_vad.onnx"
    config.sample_rate = sample_rate

    window_size = config.silero_vad.window_size

    vad = sherpa_onnx.VoiceActivityDetector(config, buffer_size_in_seconds=3)
    while len(samples) > window_size:
        vad.accept_waveform(samples[:window_size])
        samples = samples[window_size:]
        if vad.is_speech_detected():
            while not vad.empty():
                start_sec = vad.front.start / sample_rate
                duration_sec = len(vad.front.samples) / sample_rate
                print(f"start={start_sec}s duration={duration_sec}s")
                vad.pop()

main()

Output:

$ python main.py
start=0.926s duration=1.678s
start=3.774s duration=2.414s
start=7.518s duration=1.998s

Test it again with torch

main.py

# wget https://github.com/thewh1teagle/sherpa-rs/raw/main/samples/motivation.wav
# pip install torch torchaudio
# python3 main.py

import torch 
torch.set_num_threads(1)

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad')
(get_speech_timestamps, _, read_audio, _, _) = utils

wav = read_audio('samples/motivation.wav')
sample_rate = 16000

def convert_samples_to_seconds(timestamps, sample_rate):
    return [{'start': ts['start'] / sample_rate, 'end': ts['end'] / sample_rate} for ts in timestamps]

speech_timestamps = get_speech_timestamps(wav, model)
readable_timestamps = convert_samples_to_seconds(speech_timestamps, sample_rate)
for timestamp in readable_timestamps:
    print(f"start={timestamp['start']} end={timestamp['end']}")

Output:

$ python main.py
start=0.738 end=2.43
start=3.81 end=6.174
start=7.554 end=9.15
start=12.77 end=20.0

Expected behavior:
sherpa-onnx vad result should include the timestamps 12.77 to 20.0

Actual behavior:
It's missing. while it's there when inferencing the model with torch

The text was updated successfully, but these errors were encountered:

thewh1teagle · 2024-07-09T01:12:53Z

I solved the issue.
There's was three differnet issues:

I had to pad the samples with zeros at the end, so the last sample will be detected.
Change the loop logic, to take also remaining samples at the end (the loop logic depends on windows size in examples)
I had to increase buffer_in_seconds parameter, it was 3.0 and introduced overflow with lose. increased to 60.0

Btw. maybe it will be better to add the info of the lose of results in case of overflow

csukuangfj · 2024-07-09T01:19:11Z

By the way, the overflow log is just for your information. We will increase the buffer size internally.

csukuangfj · 2024-07-09T08:32:38Z

Fixed in #1099

thewh1teagle mentioned this issue Jul 8, 2024

[Feature Request]: Speaker labels (Diarization) thewh1teagle/vibe#74

Closed

altunenes mentioned this issue Jul 8, 2024

Sometimes speech doesn't detected thewh1teagle/sherpa-rs#3

Closed

thewh1teagle closed this as completed Jul 9, 2024

csukuangfj mentioned this issue Jul 9, 2024

Add Flush to VAD so that the last segment can be detected. #1099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech not detected by silero vad #1084

Speech not detected by silero vad #1084

thewh1teagle commented Jul 7, 2024

thewh1teagle commented Jul 9, 2024 •

edited

Loading

csukuangfj commented Jul 9, 2024

csukuangfj commented Jul 9, 2024

Speech not detected by silero vad #1084

Speech not detected by silero vad #1084

Comments

thewh1teagle commented Jul 7, 2024

thewh1teagle commented Jul 9, 2024 • edited Loading

csukuangfj commented Jul 9, 2024

csukuangfj commented Jul 9, 2024

thewh1teagle commented Jul 9, 2024 •

edited

Loading