-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the usage of transformers.pipeline
from BatchedInferencePipeline
and fix word timestamps for batched inference
#921
Conversation
transformers.pipeline
from BatchedInferencePipeline
transformers.pipeline
from BatchedInferencePipeline
and fix word timestamps for batched inference
@@ -621,10 +509,27 @@ def transcribe( | |||
all_language_probs=all_language_probs, | |||
) | |||
|
|||
audio_segments, segments_metadata = self.audio_split( | |||
audio, vad_segments, sampling_rate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make the check for shorter audio as the first condition check after if not_vad_segments
in line 439? This is to avoid VAD error such as m-bain/whisperX#844 by first checking the condition:
if duration < self.chunk_length:
vad_segments = [
{"start": 0.0, "end": duration, "segments": [(0.0, duration)]}
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem originates from the VAD itself, doing this will solve it for short audio files, but what about long audio files where this issue occurs in a later window? I'm waiting for the user who reported the problem to upload the test file so I can confirm this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that VAD is not able to detect it as a voiced region (considering it a noisy region). We can enable vad_offset
and vad_onset
user configurable, to aid it. The audio file, for example, is: https://litter.catbox.moe/kyu2q8.wav
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the file is 404, i've asked the user to reupload it
I'm going to see if silero will detect it correctly, and then decide how to move forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I expected, the problem is in the vad model
this is the sequential output
Hey, Cardage, do you know Candace? If you don't know who I am, tell me! BOOM! I GOT DANGER!
this is the batched output using silero vad instead of pyannote
Cardage, do you know Candace?
batched version using pyannote finds no speech in the audio
If you listen carefully to the audio you'll find that only the first sentence makes sense while the rest is very noisy that the model is just hallucinating
this is the ground truth:
Hey Carnage, do you know Candice? YOUR MOTHER HUNG HERSELF, I got that adrenaline momentum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could address it in detail in the Silero VAD PR. For the time being, we can avoid VAD computation for shorter files with duration < self.chunk_length
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the VAD refactor PR is almost ready, so I suggest we merge this and address the issue in the next PR if it's still a concern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To enhance flexibility, I think we could add a new arg to allow to specify the VAD model (pyannote/silero). This arg will be applicable in both sequential and multi-batch modes.
There have also been some recent feedback that silero v5 is worse than v4 in some languages. So we could also enable to choose the specific version of the VAD model by directly specifying its path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested the newer version for speed. Slighly slower on youtube-commons subset but has similar results for an internal benchmarking dataset. Have a look at the comment on VAD issue.
3ab66c4
to
16e85e5
Compare
@trungkienbkhn can I get a final review and merge if no changes needed? |
@MahmoudAshraf97 , LGTM. Tks for your contribution. |
…rencePipeline` and fix word timestamps for batched inference (SYSTRAN#921)" This reverts commit d57c5b4.
…rencePipeline` and fix word timestamps for batched inference (SYSTRAN#921)" This reverts commit d57c5b4.
…eline` and fix word timestamps for batched inference (SYSTRAN#921) * fix word timestamps for batched inference * remove hf pipeline
This PR removes Remove the usage of
transformers.pipeline
fromBatchedInferencePipeline
because there's no need to use it in the first placethis simplifies the code and removes a requirement
it also fixes #919
it was caused by wrong
num_frames
argument when finding the alignments, it was assumed that inferring it from encoder output size was sufficient but turned out to cause issues such as #919 when the actual segment size is much less that the inferred size