-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
removing the need for jsons
dependency
#18
Conversation
…eak in the resampler * use gpu in feature extraction for 35x speedup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks good in general. The following changes are important:
-
Since torch audio is not well-maintained(your comment), it posses a risk to remove pyAV completely from the system. Also, pyAV can still be used for media formats not supported by torchaudio.
-
We should keep the
combine_words
function in transcribe to split the 30 sec chunks to smaller segments based on VAD, providing a better visualization and usability for subtitling. -
Make sure to re-run the benchmark and get the WER numbers right (based on comments)
-
Minor (see comments).
faster_whisper/feature_extractor.py
Outdated
whisper's original torch implementation with 1e-5 tolerance. Additionally, faster | ||
feature extraction option using kaldi fbank features are available if torchaudio is | ||
available. | ||
whisper's original torch implementation with 1e-5 tolerance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not whisper's original torch implementation. Update the docstring to specify torchaudio-based FE
When testing, I encountered problem with word timestamps, they are a bit weird:
model = WhisperModel("large-v3", device="cuda")
segments, info = model.transcribe(audio_path, word_timestamps=True)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text)) Could you take a look ? |
Yes, The error in timestamps looks to repeat after every batch, it starts a lot earlier than the final timestamp of the previous batch. |
I managed to reproduce this on several audio files but after pulling a new copy of the repo I couldn't anymore, can someone test again and provide a minimum example? |
Used the example Batched version:
Did not see timestamps issue for this test. The timestamps are still not sparse, are you yet to commit those changes? Sequential version:
Surprisingly, the timestamps here are wrong, as it ends at 230 sec, longer than the total length of the audio. |
Ah I had to remove and install latest one again, it shows similar results now. I will compare with previous versions and get back. |
Yes, I managed to get the batched version to work with |
It throws error : |
I still have uncommitted changes so maybe it's fixed in them
|
@hargunmujral running multiple transcripe calls using the same pipeline will cause this for sure, the correct method is to use a either a single pipeline with maximum batch size or multiple pipelines referencing a single model @Jiltseb I'm ready |
|
technically speaking, the words are not missing because they were never generated, given the same encoder input result = self.model.generate(encoder_output, [prompt] * batch_size)
tokenizer.decode(result[0].sequences_ids[0]) # 62 tokens
>>> ' Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger.' result = self.model.generate(encoder_output, [prompt+[50364]] * batch_size) # no timestamps token added to the prompt
tokenizer.decode(result[0].sequences_ids[0]) # 61 tokens
>>> ' Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger. If I release it from this height,' note that the length of prompt + generation is 65 tokens in both cases I guess this is a problem on |
I see the issue on the sequential side as well (it hallucinates at some places even though it does not miss them). But the original faster-whisper version (sequential) does a perfect job, it does not matter if it is
|
reimplementing the original feature extraction in |
We did an internal benchmarking to compare them already and the results are the reason why we proposed to have both FE in the package based on speed vs WER trade-off.
I will still recommend to keep both versions to avoid some new transcription errors popping up once the main PR is merged. @trungkienbkhn what is your opinion? |
I suggest rerunning the benchmarks again (9f78b36 vs d95c7a6) as there should be no speed difference between |
Makes sense. I am waiting for confirmation from @trungkienbkhn before a final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some cosmetic changes are added in comments.
feature_size=80, | ||
sampling_rate=16000, | ||
hop_length=160, | ||
chunk_length=30, | ||
n_fft=400, | ||
): | ||
if device == "auto": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference in speed if FE is performed in CPU vs GPU? This needs to be evaluated before setting the default (for both short and long audios) in batched and sequential cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20s audio
5.51 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # CPU
1.1 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) # GPU
10min audio
76.3 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # CPU
8.06 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # GPU
around 5x speedup for short audio and 10x for long audio
I ran benchmarks for this with GPU H100: Sequential FW
Batched FW
|
And WER with Batched FW? |
I just updated the wer benchmark. For torchaudio kaldi, I replaced the current logic with logic in here |
Are these numbers using edit: This is a fork with updated parameters, ready to run directly |
I think it is The only change was for WER with torchaudio-kaldi from 6.5 to 6.2. But I would consider it as negligible difference, as a seasoned speech researcher (Many reasons behind this intuition).
In my opinion, the performance is similar for both short and long-form audios and we should be able to choose torchaudio. |
Yes I used |
I have done this comparison before the PR and now reran the batched whisper on our youtube-commons version, just to make sure the speed and WER stays the same (In the original batched whisper PR, the default feature extraction was torchaudio)
I also did the comparison between sequential versions:
Note: HF whisper also now uses torchaudio based FE whenever it is available |
I ran WER benchmark with youtube-common-asr dataset
There is not too much difference.
Yes, HF also use torchaudio kaldi. So I think should keep both versions of FE: current and a new option |
HF doesn't use that implementation for whisper, it's used for other models, these are the implementations used for whisper which are a numpy and a torch implementations of the original whisper implementaion. Regardless of that, HF can maintain multiple implementations because it's a multi-framework library and whisper can still be used without having pytorch installed, hence the need for a numpy implementation, but this is not the case for faster-whisper where torch must exist. Having two implementations is counter intuitive since they serve the same purpose with very close performance and no clear guide when to use one over the other, one can think of that as functionality duplication which is not a good practice another point that got my attention, is that HF kaldi uses 16-bit signed integers while we use 32-bit floats, do you think that would make a difference? this is the result without scaling
this is the result with scaling
|
As @MahmoudAshraf97 mentioned, my opinion is to go with ONLY torchaudio since it has the same functionality at an increased speed. I would say, we can stick with |
After several different benchmark runs, I noticed that torchaudio kaldi is faster, while WER doesn't change much (increased slightly but not significantly, compared with current torch FE). So I think speed should be a priority. That's also why the project name is faster-whisper. In my opinion, I agree with just using torchaudio. |
Speed is indeed a priority, but to a certain extent IMO, |
@trungkienbkhn I think there is a misunderstanding here:
Now we have to select between torchaudio FE and torch FE both have similar WER and speed (negligible differences). I don't see any problem using the current implementation as it is. If there are no objections, let's finalize on this and move to the main PR for any further reviews/comments before merging. |
@trungkienbkhn I see that the main PR is still not merged. When can this be done? |
@trungkienbkhn Discussing here on the further changes. There is a set of different opinions regarding the latest PR we added. The reason was the apparent slowness in speed for CPU versions caused when there is only a limited number of cores. There was dissatisfaction owing to the introduction of additional dependencies. As you already know, @MahmoudAshraf97 already removed several of them, including transformers and pyannote (via in-progress PR-Not referencing here now). @MahmoudAshraf97 and myself also implemented versions without torch and it finally has the same dependencies as the original FW. The slowness can be easily addressable with cpu_threads setting back to 0 as the default. So a set of follow-up PRs should stabilize the big PR that got merged. |
jsons
dependencyencode_batched
functionPyAV
and usetorchaudio
instead, this fixes the memory leak in the resamplerthe diff seems to be messed up for a reason, but the changes are minimal