Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the usage of transformers.pipeline from BatchedInferencePipeline and fix word timestamps for batched inference #921

Merged
merged 3 commits into from
Jul 27, 2024

Conversation

MahmoudAshraf97
Copy link
Collaborator

@MahmoudAshraf97 MahmoudAshraf97 commented Jul 21, 2024

This PR removes Remove the usage of transformers.pipeline from BatchedInferencePipeline because there's no need to use it in the first place
this simplifies the code and removes a requirement
it also fixes #919
it was caused by wrong num_frames argument when finding the alignments, it was assumed that inferring it from encoder output size was sufficient but turned out to cause issues such as #919 when the actual segment size is much less that the inferred size

@MahmoudAshraf97 MahmoudAshraf97 changed the title Remove the usage of transformers.pipeline from BatchedInferencePipeline Remove the usage of transformers.pipeline from BatchedInferencePipeline and fix word timestamps for batched inference Jul 21, 2024
@@ -621,10 +509,27 @@ def transcribe(
all_language_probs=all_language_probs,
)

audio_segments, segments_metadata = self.audio_split(
audio, vad_segments, sampling_rate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the check for shorter audio as the first condition check after if not_vad_segments in line 439? This is to avoid VAD error such as m-bain/whisperX#844 by first checking the condition:

if duration < self.chunk_length:
                vad_segments = [
                    {"start": 0.0, "end": duration, "segments": [(0.0, duration)]}
                ]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem originates from the VAD itself, doing this will solve it for short audio files, but what about long audio files where this issue occurs in a later window? I'm waiting for the user who reported the problem to upload the test file so I can confirm this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that VAD is not able to detect it as a voiced region (considering it a noisy region). We can enable vad_offset and vad_onset user configurable, to aid it. The audio file, for example, is: https://litter.catbox.moe/kyu2q8.wav

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the file is 404, i've asked the user to reupload it
I'm going to see if silero will detect it correctly, and then decide how to move forward

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I expected, the problem is in the vad model
this is the sequential output
Hey, Cardage, do you know Candace? If you don't know who I am, tell me! BOOM! I GOT DANGER!
this is the batched output using silero vad instead of pyannote
Cardage, do you know Candace?
batched version using pyannote finds no speech in the audio

If you listen carefully to the audio you'll find that only the first sentence makes sense while the rest is very noisy that the model is just hallucinating
this is the ground truth:
Hey Carnage, do you know Candice? YOUR MOTHER HUNG HERSELF, I got that adrenaline momentum

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could address it in detail in the Silero VAD PR. For the time being, we can avoid VAD computation for shorter files with duration < self.chunk_length.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the VAD refactor PR is almost ready, so I suggest we merge this and address the issue in the next PR if it's still a concern

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To enhance flexibility, I think we could add a new arg to allow to specify the VAD model (pyannote/silero). This arg will be applicable in both sequential and multi-batch modes.
There have also been some recent feedback that silero v5 is worse than v4 in some languages. So we could also enable to choose the specific version of the VAD model by directly specifying its path.

Copy link
Contributor

@Jiltseb Jiltseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the newer version for speed. Slighly slower on youtube-commons subset but has similar results for an internal benchmarking dataset. Have a look at the comment on VAD issue.

@MahmoudAshraf97
Copy link
Collaborator Author

@trungkienbkhn can I get a final review and merge if no changes needed?

@trungkienbkhn
Copy link
Collaborator

@MahmoudAshraf97 , LGTM. Tks for your contribution.

@trungkienbkhn trungkienbkhn merged commit d57c5b4 into SYSTRAN:master Jul 27, 2024
3 checks passed
@MahmoudAshraf97 MahmoudAshraf97 deleted the no_pipeline branch July 27, 2024 16:36
aligokalppeker added a commit to aligokalppeker/faster-whisper that referenced this pull request Jul 29, 2024
…rencePipeline` and fix word timestamps for batched inference (SYSTRAN#921)"

This reverts commit d57c5b4.
shinlw added a commit to shinlw/faster-whisper that referenced this pull request Sep 6, 2024
…rencePipeline` and fix word timestamps for batched inference (SYSTRAN#921)"

This reverts commit d57c5b4.
Jiltseb pushed a commit to mobiusml/faster-whisper that referenced this pull request Oct 8, 2024
…eline` and fix word timestamps for batched inference (SYSTRAN#921)

* fix word timestamps for batched inference

* remove hf pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segment timestamps are buggy in BatchedInferencePipeline
3 participants