removing the need for `jsons` dependency #18

MahmoudAshraf97 · 2024-06-20T22:42:12Z

removing the need for jsons dependency
remove tokenizer reinitialization when language or task is changed since it can be changed directly without creating a new tokenizer
remove the need for a separate encode_batched function
enable word timestamps using original functions
remove PyAV and use torchaudio instead, this fixes the memory leak in the resampler
use gpu in feature extraction for 35x speedup, I also removed the old feature extraction since it has no use case, so there's no need to keep it while we have a faster one that has the same accuracy

the diff seems to be messed up for a reason, but the changes are minimal

MahmoudAshraf97 · 2024-06-21T16:39:17Z

@Jiltseb

…eak in the resampler * use gpu in feature extraction for 35x speedup

faster_whisper/transcribe.py

Jiltseb

PR looks good in general. The following changes are important:

Since torch audio is not well-maintained(your comment), it posses a risk to remove pyAV completely from the system. Also, pyAV can still be used for media formats not supported by torchaudio.
We should keep the combine_words function in transcribe to split the 30 sec chunks to smaller segments based on VAD, providing a better visualization and usability for subtitling.
Make sure to re-run the benchmark and get the WER numbers right (based on comments)
Minor (see comments).

faster_whisper/audio.py

Jiltseb · 2024-06-24T09:57:22Z

faster_whisper/feature_extractor.py

-        whisper's original torch implementation with 1e-5 tolerance. Additionally, faster
-        feature extraction option using kaldi fbank features are available if torchaudio is
-        available.
+        whisper's original torch implementation with 1e-5 tolerance.


This is not whisper's original torch implementation. Update the docstring to specify torchaudio-based FE

faster_whisper/transcribe.py

tests/test_transcribe.py

faster_whisper/transcribe.py

trungkienbkhn · 2024-06-25T09:10:09Z

When testing, I encountered problem with word timestamps, they are a bit weird:
[68.42s -> 68.92s] => [41.48s -> 42.58s] => [68.92s -> 71.76s]
[140.60s -> 146.26s] => [121.52s -> 122.52s] => [126.24s -> 127.54s]
...

[0.00s -> 1.88s]  Would you see what you can get for this, please?
[5.36s -> 7.44s]  Not the royal ring, your highness.
[10.50s -> 10.94s]  Shh!
[12.12s -> 12.60s]  Do you want to help out?
[27.38s -> 28.44s]  Excuse me.
[45.36s -> 49.24s]  Is that man actually royalty?
[52.74s -> 54.02s]  No, madame.
[55.54s -> 57.40s]  But you called him your highness.
[59.42s -> 60.74s]  It was a faux pas.
[62.42s -> 63.34s]  Please forget about it.
[64.20s -> 67.12s]  You can trust me.
[68.42s -> 68.92s]  I won't tell.
[41.48s -> 42.58s]  Madame, I am...
[68.92s -> 71.76s]  Extraordinary man of destiny.
[92.46s -> 96.00s]  Your highness.
[100.20s -> 101.82s]  Your highness, don't be alarmed.
[105.18s -> 106.26s]  I can be trusted.
[108.82s -> 110.56s]  Are you one of my subjects?
[112.96s -> 113.60s]  No.
[114.38s -> 118.64s]  I'm an American, Fanny Eubanks of Omaha.
[121.44s -> 122.66s]  I couldn't help overhearing.
[126.06s -> 126.38s]  If you're in trouble and there's some way I can help,
[98.92s -> 101.24s]  thank you, but I cannot accept.
[105.72s -> 108.84s]  You've already risked too much just in speaking to me.
[112.22s -> 118.18s]  I still want to help.
[123.90s -> 127.10s]  You must understand, I have powerful enemies.
[131.92s -> 135.42s]  They may be watching even as we...
[140.60s -> 146.26s]  My God, you're attractive.
[121.52s -> 122.52s]  It's late.
[126.24s -> 127.54s]  I must go.
[128.92s -> 134.88s]  As he left?
[141.82s -> 143.78s]  Yes, just a moment to calm.
[148.94s -> 149.74s]  Good.
[151.90s -> 152.28s]  Please.
[153.72s -> 158.90s]  You must tell me where he lives.
[164.14s -> 166.36s]  I feel it only fair to wonder.
[168.64s -> 169.78s]  I know.
[170.68s -> 174.12s]  He told me he has powerful enemies.
[176.62s -> 178.06s]  There may also be an emotional risk.
[182.24s -> 182.56s]  You see, his highness has been a widower for five years.
[158.92s -> 160.20s]  For five years?
[174.20s -> 177.00s]  Please, your highness.
[181.04s -> 187.32s]  Fanny, the Freedom Fighters thank you.
[195.52s -> 197.20s]  This is for the overhead.
[200.52s -> 202.24s]  This goes to you, Arthur.
[206.08s -> 208.70s]  This goes to you, Andre.
[211.58s -> 212.50s]  This goes to me, which means it's time to go to Zurich.
[188.92s -> 190.18s]  Thank you.

model = WhisperModel("large-v3", device="cuda")
segments, info = model.transcribe(audio_path, word_timestamps=True)
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Could you take a look ?

Jiltseb · 2024-06-25T11:05:59Z

Yes, The error in timestamps looks to repeat after every batch, it starts a lot earlier than the final timestamp of the previous batch.

MahmoudAshraf97 · 2024-06-25T13:11:32Z

I managed to reproduce this on several audio files but after pulling a new copy of the repo I couldn't anymore, can someone test again and provide a minimum example?

Jiltseb · 2024-06-25T15:09:57Z

Used the example faster-whisper/tests/data/physicsworks.wav to test both batched and sequential versions.

Batched version:

[0.00s -> 27.44s] Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms, and I can lift it up one meter, which I have done now. That means I've done work. MGH is the work I have done, believe me. I've increased the potential energy of this object. 15 times 10 is about 150 joules. If I let it fall,
[28.11s -> 53.91s] then that will be converted to kinetic energy. If I would let it swing from one meter height, and you would be there, and it would hit you, you'd be dead. 150 joules is enough to kill you. They use these devices. It's called a wrecker ball. They use them to demolish buildings. You lift up a very heavy object, even heavier than this,
[54.35s -> 82.93s] and then you let it go, you swing it, thereby converting gravitational potential energy into kinetic energy, and that way you can demolish a building. You just let it hit... and it breaks a building. And that's the whole idea of wrecking. So you're using, then, the conversion of gravitational potential energy to kinetic energy.
[84.21s -> 113.45s] Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger. If I release it from this height,
[114.17s -> 137.71s] and it swings, then when it reaches here, it could not be higher. There is a conversion from gravitational potential energy to kinetic energy back to gravitational potential energy, and it will come to a stop here. And when it swings back, it should not be able to reach any higher, provided that I do not give this object an initial speed when I stand here.
[141.28s -> 169.74s] the conservation of mechanical energy for 100%. I may not trust myself. I'm going to release this object, and I hope I will be able to do it at zero speed, so that when it comes back, it may touch my chin, but it may not crush my chin. I want you to be extremely quiet, because this is no joke. If I don't succeed in giving it zero speed,
[170.18s -> 187.62s] Then, this will be my last lecture. I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. 3, 2, 1, 0.
[200.73s -> 202.73s] Physics works, and I'm still alive.

Did not see timestamps issue for this test. The timestamps are still not sparse, are you yet to commit those changes?

Sequential version:

[0.00s -> 8.98s]  Now I want to return to the conservation of mechanical energy. I have here a pendulum.
[20.48s -> 25.00s]  I have an object that weighs 15 kilograms, and I can lift it up one meter, which I have
[30.22s -> 36.12s]  done now. That means I've done work. MGH is the work I have done, believe me. I've increased
[42.38s -> 48.14s]  the potential energy of this object. 15 times 10 is about 150 joules. If I let it fall,
[48.14s -> 53.56s]  I use them to demolish buildings. You lift up a very heavy object, even heavier than
[59.98s -> 66.12s]  this, and then you let it go. You swing it, thereby converting gravitational potential
[73.08s -> 82.42s]  energy into kinetic energy, and that way you can demolish a building. You just let it hit,
[93.42s -> 99.78s]  and it breaks a building. And that's the whole idea of wrecking. So you are using then the
[100.10s -> 110.40s]  that bob from a certain height, then that bob can never come back to a point where the height
[120.98s -> 128.02s]  is any larger. If I release it from this height, and it swings, then when it reaches here,
[135.30s -> 140.78s]  it could not be higher. There is a conversion from gravitational potential energy to kinetic
[146.44s -> 149.92s]  energy back to gravitational potential energy, and it will come to a stop here. And when it
[156.32s -> 163.90s]  comes back, it may touch my chin, but it may not crush my chin. I want you to be extremely
[178.48s -> 185.58s]  quiet, because this is no joke. If I don't succeed in giving it zero speed, then this
[193.10s -> 200.52s]  will be my last lecture. I will close my eyes. I don't want to see this. So please be very
[201.40s -> 201.58s]  quiet.
[230.28s -> 230.76s]  Physics works, and I'm still alive.

Surprisingly, the timestamps here are wrong, as it ends at 230 sec, longer than the total length of the audio.

Jiltseb · 2024-06-25T16:04:15Z

Ah I had to remove and install latest one again, it shows similar results now. I will compare with previous versions and get back.

MahmoudAshraf97 · 2024-06-25T16:07:25Z

Yes, I managed to get the batched version to work with without_timestamps=False and it outputs the same segments as the non batched version, but it's much slower so I'm still trying to investigate this, from what I understand, the generation step is three times or more slower when timestamp tokens are generated, and this slowdown increases when batch size is increased

Jiltseb · 2024-06-25T16:09:32Z

It throws error : [token for token in subsegment["tokens"] if token < tokenizer.eot] TypeError: string indices must be integers when using word_timestamps to True in batched version. Can you check that?

MahmoudAshraf97 · 2024-06-25T16:11:31Z

I still have uncommitted changes so maybe it's fixed in them
this is the result for batched version with both segment timestamps and word timestamps, large-v3 model
notice that the last subsegment in each segment is skipped so I'm still working on it

[0.00s -> 8.86s]  Now I want to return to the conservation of mechanical energy. I have here a pendulum.
[20.36s -> 24.88s]  I have an object that weighs 15 kilograms and I can lift it up one meter, which I have
[29.90s -> 35.12s]  done now. That means I've done work. MGH is the work I have done, believe me. I've increased
[28.11s -> 30.57s]  then that will be converted to kinetic energy.
[35.83s -> 40.91s]  If I would let it swing from one meter height,
[46.21s -> 50.51s]  and you would be there and it would hit you, you'd be dead.
[54.17s -> 57.41s]  150 joules is enough to kill you.
[61.63s -> 64.99s]  They use these devices, they're called a racquetball,
[68.59s -> 68.91s]  they use them to demolish buildings.
[54.35s -> 59.77s]  And then you let it go, you swing it, thereby converting gravitational potential energy
[66.69s -> 71.85s]  into kinetic energy and that way you can demolish a building.
[77.51s -> 85.07s]  You just let it hit and it breaks a building and that's the whole idea of wrecking.
[83.87s -> 93.17s]  Now, I am such a strong believer of the conservation of mechanical energy that I am willing to
[103.93s -> 108.49s]  put my life on the line.
[113.61s -> 124.47s]  If I release that bulb from a certain height, then that bulb can never come back to a point
[135.53s -> 136.53s]  where the height is any larger.
[114.17s -> 119.85s]  and it swings, then when it reaches here it could not be higher. There is a conversion
[126.39s -> 131.03s]  from gravitational potential energy to kinetic energy back to gravitational potential energy
[135.99s -> 140.47s]  and it will come to a stop here. And when it swings back it should not be able to reach
[141.28s -> 144.38s]  the conservation of mechanical energy, for 100%.
[150.30s -> 154.32s]  I may not trust myself.
[158.32s -> 160.72s]  I'm going to release this object,
[163.32s -> 167.16s]  and I hope I will be able to do it at zero speed,
[171.20s -> 174.24s]  so that when it comes back, it may touch my chin,
[177.56s -> 181.00s]  but it may not crush my chin.
[185.04s -> 186.74s]  I want you to be extremely quiet, because this is no joke.
[170.18s -> 178.52s]  this will be my last lecture. I will close my eyes. I don't want to see this. So please
[200.73s -> 202.49s]  Physics works, and I'm still alive.

without_timestamps=True

[0.00s -> 27.24s]  Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms and I can lift it up one meter, which I have done now. That means I have done work. MGH is the work I have done, believe me. I have increased the potential energy of this object. Fifteen times ten is about 150 joules. If I let it fall,
[28.11s -> 53.79s]  then that will be converted to kinetic energy. If I would let it swing from one meter height and you would be there and it would hit you, you'd be dead. 150 joules is enough to kill you. They use these devices, it's called a racquetball, they use them to demolish buildings. You lift up a very heavy object, even heavier than this,
[54.35s -> 82.81s]  And then you let it go, you swing it, thereby converting gravitational potential energy into kinetic energy. And that way you can demolish a building. You just let it hit and it breaks a building. And that's the whole idea of wrecking. So you are using then the conversion of gravitational potential energy to kinetic energy.
[83.87s -> 113.27s]  Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger. If I release it from this height,
[114.17s -> 140.35s]  and it swings then when it reaches here it could not be higher. There is a conversion from gravitational potential energy to kinetic energy back to gravitational potential energy and it will come to a stop here. And when it swings back it should not be able to reach any higher provided that I do not give this object an initial speed when I stand here. I trust
[141.28s -> 169.40s]  the conservation of mechanical energy for 100%. I may not trust myself. I'm going to release this object, and I hope I will be able to do it at zero speed, so that when it comes back, it may touch my chin, but it may not crush my chin. I want you to be extremely quiet, because this is no joke. If I don't succeed in giving it zero speed,
[170.44s -> 187.50s]  then this will be my last lecture. I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. Three, two, one, zero.
[200.73s -> 202.49s]  Physics works, and I'm still alive.

MahmoudAshraf97 · 2024-06-27T20:01:17Z

@hargunmujral running multiple transcripe calls using the same pipeline will cause this for sure, the correct method is to use a either a single pipeline with maximum batch size or multiple pipelines referencing a single model

@Jiltseb I'm ready

Jiltseb · 2024-06-28T10:16:29Z

@MahmoudAshraf97 I just compared default batched transcription and its variant when both word_timestamps=True and without_timestamps=False are set. There are some missing words in the second version. Can you check?

Default:

[0.00s -> 27.79s]  Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms and I can lift it up one meter, which I have done now. That means I have done work. mgh is the work I have done, believe me. I have increased the potential energy of this object. Fifteen times ten is about 150 joules. If I let it fall,
[28.11s -> 54.20s]  then that will be converted to kinetic energy. If I would let it swing from one meter height and you would be there and it would hit you, you'd be dead. 150 joules is enough to kill you. They use these devices, it's called a racquetball, they use them to demolish buildings. You lift up a very heavy object, even heavier than this,
[54.35s -> 83.17s]  And then you let it go, you swing it, thereby converting gravitational potential energy into kinetic energy. And that way you can demolish a building. You just let it hit and it breaks a building. And that's the whole idea of wrecking. So you are using then the conversion of gravitational potential energy to kinetic energy.
[83.87s -> 113.77s]  Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger. If I release it from this height,
[114.17s -> 140.95s]  and it swings then when it reaches here it could not be higher. There is a conversion from gravitational potential energy to kinetic energy back to gravitational potential energy and it will come to a stop here. And when it swings back it should not be able to reach any higher provided that I do not give this object an initial speed when I stand here. I trust
[141.28s -> 170.04s]  the conservation of mechanical energy for 100%. I may not trust myself. I'm going to release this object, and I hope I will be able to do it at zero speed, so that when it comes back, it may touch my chin, but it may not crush my chin. I want you to be extremely quiet, because this is no joke. If I don't succeed in giving it zero speed,
[170.18s -> 187.90s]  then this will be my last lecture. I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. Three, two, one, zero.
[200.73s -> 202.90s]  Physics works, and I'm still alive.
Elapsed time: 3.1877

Default with word_timestamps=True and without_timestamps=False.

[0.00s -> 8.86s]  Now I want to return to the conservation of mechanical energy. I have here a pendulum.
[9.74s -> 14.52s]  I have an object that weighs 15 kilograms and I can lift it up one meter, which I have
[14.52s -> 20.10s]  done now. That means I have done work. MGH is the work I have done, believe me. I have
[20.10s -> 27.26s]  increased the potential energy of this object. 15 x 10 is about 150 joules. If I let it fall,
[28.11s -> 30.57s]  then that will be converted to kinetic energy.
[31.59s -> 36.35s]  If I would let it swing from one meter height,
[37.03s -> 39.49s]  and you would be there and it would hit you, you'd be dead.
[40.65s -> 42.53s]  150 joules is enough to kill you.
[43.89s -> 46.91s]  They use these devices, it's called a racquetball,
[47.47s -> 49.21s]  they use them to demolish buildings.
[49.97s -> 53.79s]  You lift up a very heavy object, even heavier than this,
[54.35s -> 59.79s]  And then you let it go, you swing it, thereby converting gravitational potential energy
[60.35s -> 64.95s]  into kinetic energy and that way you can demolish a building.
[65.39s -> 73.53s]  You just let it hit and it breaks a building and that's the whole idea of wrecking.
[74.95s -> 82.81s]  So you are using then the conversion of gravitational potential energy to kinetic energy.
[83.87s -> 93.17s]  Now, I am such a strong believer of the conservation of mechanical energy that I am willing to
[93.83s -> 95.73s]  put my life on the line.
[97.85s -> 108.71s]  If I release that bulb from a certain height, then that bulb can never come back to a point
[109.47s -> 110.93s]  where the height is any larger.
**[114.17s -> 119.85s]  and it swings,** then when it reaches here it could not be higher. There is a conversion : MISSING WORD
[119.85s -> 124.95s]  from gravitational potential energy to kinetic energy back to gravitational potential energy
[124.95s -> 129.65s]  and it will come to a stop here. And when it swings back it should not be able to reach
[130.22s -> 137.55s]  any higher, provided that I do not give this object an initial speed when I stand here.
**[141.28s -> 144.36s]  the conservation of mechanical energy for 100%** : MISSING WORD
[145.42s -> 146.84s]  I may not trust myself.
[149.44s -> 151.16s]  I'm going to release this object,
[151.74s -> 154.54s]  and I hope I will be able to do it at zero speed,
[155.62s -> 158.34s]  so that when it comes back, it may touch my chin,
[159.02s -> 160.32s]  but it may not crush my chin.
[162.24s -> 165.44s]  I want you to be extremely quiet, because this is no joke.
[166.30s -> 169.40s]  If I don't succeed in giving it zero speed,
[170.42s -> 178.52s]  this will be my last lecture. I will close my eyes. I don't want to see this. So please
[178.52s -> 187.50s]  be very quiet. I almost didn't sleep all night. Three, two, one, zero.
[200.73s -> 202.51s]  Physics works, and I'm still alive.
Elapsed time: 5.2835

MahmoudAshraf97 · 2024-06-28T12:21:07Z

technically speaking, the words are not missing because they were never generated, given the same encoder input

result = self.model.generate(encoder_output, [prompt] * batch_size)
tokenizer.decode(result[0].sequences_ids[0]) # 62 tokens
>>> ' Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger.'

result = self.model.generate(encoder_output, [prompt+[50364]] * batch_size) # no timestamps token added to the prompt
tokenizer.decode(result[0].sequences_ids[0]) # 61 tokens
>>> ' Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger. If I release it from this height,'

note that the length of prompt + generation is 65 tokens in both cases
the strange thing is that when the timestamp is enabled, appending tokenizer.language or tokenizer.language token to the prompt seems to fix the issue in this specific segment, but not in all segments

I guess this is a problem on ctranslate2 side

Jiltseb · 2024-06-28T15:08:05Z

I see the issue on the sequential side as well (it hallucinates at some places even though it does not miss them). But the original faster-whisper version (sequential) does a perfect job, it does not matter if it is withput_timestamps or not. Interestingly if I enable torchaudio feature extraction, it still hallucinates in sequential. The features have different dynamic ranges,

I would suggest keeping enable_ta_fe flag and original feature extraction for the sequential version.
For the batched version, the issue can be alleviated via vad-based segmentation, instead of using without_timestamps=True or by still using the original feature extraction (with a speed trade-off).

MahmoudAshraf97 · 2024-06-29T15:12:45Z

reimplementing the original feature extraction in torch instead of numpy can be faster that the kaldi implementation, but it results in a slightly higher WER on the included benchmark, so we keep the current transcription performance of faster whisper without the speed trade-off, btw the kaldi implementation might perform worse on the included file, but it performed better on other files too, I guess the final choice between the two should be based on extensive benchmarking

Jiltseb · 2024-06-29T17:06:16Z

We did an internal benchmarking to compare them already and the results are the reason why we proposed to have both FE in the package based on speed vs WER trade-off.

Batched whisper torchaudio
WER: 6.5
Speed: 86.6x

Batched whisper Default
WER: 6.14
Speed: 65.8x

I will still recommend to keep both versions to avoid some new transcription errors popping up once the main PR is merged.

@trungkienbkhn what is your opinion?

MahmoudAshraf97 · 2024-06-29T17:11:24Z

I suggest rerunning the benchmarks again (9f78b36 vs d95c7a6) as there should be no speed difference between torchaudio and the implementation in this PR, and go with the one with the lowest WER, I'd vote against keeping two feature extraction implementations as the difference in performance is negligible and would be a hassle to maintain (ignoring the fact that most users will not bother with testing both and choosing the best one)

Jiltseb · 2024-07-01T07:38:52Z

Makes sense. I am waiting for confirmation from @trungkienbkhn before a final review.

Jiltseb

LGTM, some cosmetic changes are added in comments.

Jiltseb · 2024-06-28T08:20:29Z

faster_whisper/feature_extractor.py

        feature_size=80,
        sampling_rate=16000,
        hop_length=160,
        chunk_length=30,
        n_fft=400,
    ):
+        if device == "auto":


what is the difference in speed if FE is performed in CPU vs GPU? This needs to be evaluated before setting the default (for both short and long audios) in batched and sequential cases.

20s audio
5.51 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # CPU
1.1 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) # GPU

10min audio
76.3 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # CPU
8.06 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # GPU

around 5x speedup for short audio and 10x for long audio

faster_whisper/transcribe.py

faster_whisper/feature_extractor.py

trungkienbkhn · 2024-07-02T12:17:20Z

I suggest rerunning the benchmarks again (9f78b36 vs d95c7a6) as there should be no speed difference between torchaudio and the implementation in this PR, and go with the one with the lowest WER, I'd vote against keeping two feature extraction implementations as the difference in performance is negligible and would be a hassle to maintain (ignoring the fact that most users will not bother with testing both and choosing the best one)

I ran benchmarks for this with GPU H100:

Sequential FW

Speed

Current: 36.459s
Torchaudio kaldi: 35.749s

WER

Current: 3.212
Torchaudio kaldi: 2.002

Batched FW

Speed

Current: 7.084s
Torchaudio kaldi: 7.275s

WER

Current: 1.658
Torchaudio kaldi: 1.773

Jiltseb · 2024-07-02T12:27:43Z

And WER with Batched FW?
Note the typo- speech for total processing time.

trungkienbkhn · 2024-07-02T16:09:27Z

And WER with Batched FW? Note the typo- speech for total processing time.

I just updated the wer benchmark. For torchaudio kaldi, I replaced the current logic with logic in here

MahmoudAshraf97 · 2024-07-02T18:25:33Z

Are these numbers using librispeech_asr if so, I guess we need to test it on longform audio too, so far the conclusion is that kaldi is superior in sequential and almost the same in batched, and both have the same speed
I suggest running the whole benchmark suite here

edit: This is a fork with updated parameters, ready to run directly

Jiltseb · 2024-07-02T20:45:51Z

I think it is librispeech_asr looking at the numbers. I have tested it on long-form audio internal benchmark and the open-source benchmark I shared in the original PR.

The only change was for WER with torchaudio-kaldi from 6.5 to 6.2. But I would consider it as negligible difference, as a seasoned speech researcher (Many reasons behind this intuition).

open_asr_eval is NOT a long-form audio dataset, infact, the average duration of samples comes under 10 sec in most of them. It is ONLY good for evaluating short-form audio. For long-form audio, we can use a subset of Youtube-Commons, It provides a realistic measure the speed, although the WER values need to be taken with a pinch of salt (owing to errors in Human annotation).

In my opinion, the performance is similar for both short and long-form audios and we should be able to choose torchaudio.

trungkienbkhn · 2024-07-03T06:09:29Z

Yes I used librispeech_asr. I will try with mobiuslabsgmbh/youtube-commons-asr-eval

Jiltseb · 2024-07-03T08:10:44Z

I have done this comparison before the PR and now reran the batched whisper on our youtube-commons version, just to make sure the speed and WER stays the same (In the original batched whisper PR, the default feature extraction was torchaudio)

#Before (torchaudio)
#average speed: 104x
#Test WER: 0.13164

#Before (systran FE)
#average speed: 63.5x
#Test WER: 0.13143

After changes (torchaudio)
#average speed: 103.7x
#Test WER: 0.13125

I also did the comparison between sequential versions:

#with systran FE
# average speed: 20.1x
#WER: 14.6

#with torchaudio
#average speed: 28.5x
#WER: 14.4

Note: HF whisper also now uses torchaudio based FE whenever it is available

trungkienbkhn · 2024-07-03T11:29:22Z

I ran WER benchmark with youtube-common-asr dataset

WER: 13.287 torch kaldi
WER: 13.120 current

There is not too much difference.

Note: HF whisper also now uses torchaudio based FE whenever it is available

Yes, HF also use torchaudio kaldi. So I think should keep both versions of FE: current and a new option enable_ta_fe (maybe with default value is True). Do you have any other opinions?

MahmoudAshraf97 · 2024-07-03T11:44:53Z

I ran WER benchmark with youtube-common-asr dataset

WER: 13.287 torch kaldi

WER: 13.120 current

There is not too much difference.

Note: HF whisper also now uses torchaudio based FE whenever it is available

Yes, HF also use torchaudio kaldi. So I think should keep both versions of FE: current and a new option enable_ta_fe (maybe with default value is True). Do you have any other opinions?

HF doesn't use that implementation for whisper, it's used for other models, these are the implementations used for whisper which are a numpy and a torch implementations of the original whisper implementaion.

Regardless of that, HF can maintain multiple implementations because it's a multi-framework library and whisper can still be used without having pytorch installed, hence the need for a numpy implementation, but this is not the case for faster-whisper where torch must exist. Having two implementations is counter intuitive since they serve the same purpose with very close performance and no clear guide when to use one over the other, one can think of that as functionality duplication which is not a good practice

another point that got my attention, is that HF kaldi uses 16-bit signed integers while we use 32-bit floats, do you think that would make a difference?
to test we should add this line in the kaldi branch just before the fbank function
waveform = waveform * (2**15)

this is the result without scaling

Evaluating...: 1023it [09:30,  1.79it/s]
WER: 2.374

this is the result with scaling

Evaluating...: 1023it [09:27,  1.80it/s]
WER: 2.656

Jiltseb · 2024-07-03T13:02:05Z

As @MahmoudAshraf97 mentioned, my opinion is to go with ONLY torchaudio since it has the same functionality at an increased speed.

I would say, we can stick with fp32 for the time-being as int16 does not add any reasonable benefit and would require further benchmarking and visual inspection.

trungkienbkhn · 2024-07-04T09:29:21Z

After several different benchmark runs, I noticed that torchaudio kaldi is faster, while WER doesn't change much (increased slightly but not significantly, compared with current torch FE). So I think speed should be a priority. That's also why the project name is faster-whisper. In my opinion, I agree with just using torchaudio.
@Jiltseb , could you rollback code to use torchaudio for FE ? Tks.

MahmoudAshraf97 · 2024-07-04T09:34:38Z

Speed is indeed a priority, but to a certain extent IMO, faster-whisper is already the fastest implementation available after TensorRT-LLM, so squeezing more speed while trading off WER isn't a wise decision if you ask me, given that the speed difference between the two is 104x vs 103.7x as you stated in an earlier comment

Jiltseb · 2024-07-04T13:14:50Z

@trungkienbkhn I think there is a misunderstanding here:

We all agree original faster whisper feature extraction is slower and there is no great benefit in WER.
Then there is torchaudio-kaldi based feature extraction that has faster speed and similar WER.
The current FE implementation provided by @MahmoudAshraf97 uses torch.stft (like in whisperx), which is also faster than the original faster whisper implementation and similar WER.

Now we have to select between torchaudio FE and torch FE both have similar WER and speed (negligible differences).

I don't see any problem using the current implementation as it is. If there are no objections, let's finalize on this and move to the main PR for any further reviews/comments before merging.

Jiltseb · 2024-07-08T11:42:34Z

@trungkienbkhn I see that the main PR is still not merged. When can this be done?

Jiltseb · 2024-08-14T12:28:42Z

@trungkienbkhn Discussing here on the further changes.

There is a set of different opinions regarding the latest PR we added. The reason was the apparent slowness in speed for CPU versions caused when there is only a limited number of cores. There was dissatisfaction owing to the introduction of additional dependencies.

As you already know, @MahmoudAshraf97 already removed several of them, including transformers and pyannote (via in-progress PR-Not referencing here now). @MahmoudAshraf97 and myself also implemented versions without torch and it finally has the same dependencies as the original FW. The slowness can be easily addressable with cpu_threads setting back to 0 as the default.

So a set of follow-up PRs should stabilize the big PR that got merged.
Can we discuss the feasibility here?

MahmoudAshraf97 added 5 commits June 21, 2024 01:29

.

17e30a4

Merge branch 'mobiusml:master' into master

46532fc

remove tokenizer reinitialization

ad2379b

remove the need for a separate encode_batched function

abcbedd

fix flake8 error

f584a6c

MahmoudAshraf97 mentioned this pull request Jun 21, 2024

Added punctuation changes in word_timestamps, removed jsons requirement #19

Closed

enable word timestamps using original functions

ebf7b65

* remove PyAV and use torchaudio instead, this fixes the memory l…

7f84e34

…eak in the resampler * use gpu in feature extraction for 35x speedup

MahmoudAshraf97 mentioned this pull request Jun 22, 2024

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements SYSTRAN/faster-whisper#856

Merged

trungkienbkhn reviewed Jun 24, 2024

View reviewed changes

faster_whisper/transcribe.py Outdated Show resolved Hide resolved

Jiltseb self-requested a review June 24, 2024 07:21

MahmoudAshraf97 added 2 commits June 24, 2024 10:59

added back np.ndarray support for transcribe

b54d828

fix wrong padding scheme leading to very high WER

2c617c2

Jiltseb requested changes Jun 24, 2024

View reviewed changes

MahmoudAshraf97 added 2 commits June 24, 2024 15:02

remove num_workers argument from batched transcribe

99d61e0

generalized word timestamps function

aef4b97

Jiltseb requested a review from trungkienbkhn June 25, 2024 07:35

remove redundant parameters related to num_workers

5fc5fca

MahmoudAshraf97 added 3 commits June 25, 2024 22:40

fix word timestamps for non-batched inference

389da33

support without_timestamps in batched mode

2b0a252

adjust tests

f03d8ca

Jiltseb self-requested a review June 27, 2024 20:03

MahmoudAshraf97 added 2 commits June 29, 2024 16:34

enable running benchmark from anywhere

9f78b36

review feature extraction implementation

d95c7a6

formatting fixes

968057e

Jiltseb approved these changes Jul 1, 2024

View reviewed changes

Jiltseb merged commit eff81f5 into mobiusml:master Jul 1, 2024

MahmoudAshraf97 mentioned this pull request Sep 18, 2024

Performance Regression in Whisper models when timestamp generation is enabled OpenNMT/CTranslate2#1783

Open

removing the need for jsons dependency #18

removing the need for jsons dependency #18

Conversation

MahmoudAshraf97 commented Jun 20, 2024 • edited Loading

MahmoudAshraf97 commented Jun 21, 2024

Jiltseb left a comment

Choose a reason for hiding this comment

Jiltseb Jun 24, 2024

Choose a reason for hiding this comment

trungkienbkhn commented Jun 25, 2024 • edited Loading

Jiltseb commented Jun 25, 2024

MahmoudAshraf97 commented Jun 25, 2024

Jiltseb commented Jun 25, 2024

Jiltseb commented Jun 25, 2024

MahmoudAshraf97 commented Jun 25, 2024

Jiltseb commented Jun 25, 2024

MahmoudAshraf97 commented Jun 25, 2024 • edited Loading

MahmoudAshraf97 commented Jun 27, 2024

Jiltseb commented Jun 28, 2024 • edited Loading

MahmoudAshraf97 commented Jun 28, 2024

Jiltseb commented Jun 28, 2024

MahmoudAshraf97 commented Jun 29, 2024

Jiltseb commented Jun 29, 2024 • edited Loading

MahmoudAshraf97 commented Jun 29, 2024 • edited Loading

Jiltseb commented Jul 1, 2024 • edited Loading

Jiltseb left a comment

Choose a reason for hiding this comment

Jiltseb Jun 28, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 Jul 1, 2024

Choose a reason for hiding this comment

trungkienbkhn commented Jul 2, 2024 • edited Loading

Sequential FW

Batched FW

Jiltseb commented Jul 2, 2024

trungkienbkhn commented Jul 2, 2024

MahmoudAshraf97 commented Jul 2, 2024 • edited Loading

Jiltseb commented Jul 2, 2024

trungkienbkhn commented Jul 3, 2024

Jiltseb commented Jul 3, 2024 • edited Loading

trungkienbkhn commented Jul 3, 2024

MahmoudAshraf97 commented Jul 3, 2024 • edited Loading

Jiltseb commented Jul 3, 2024

trungkienbkhn commented Jul 4, 2024

MahmoudAshraf97 commented Jul 4, 2024

Jiltseb commented Jul 4, 2024 • edited Loading

Jiltseb commented Jul 8, 2024 • edited Loading

Jiltseb commented Aug 14, 2024

removing the need for `jsons` dependency #18

removing the need for `jsons` dependency #18

MahmoudAshraf97 commented Jun 20, 2024 •

edited

Loading

trungkienbkhn commented Jun 25, 2024 •

edited

Loading

MahmoudAshraf97 commented Jun 25, 2024 •

edited

Loading

Jiltseb commented Jun 28, 2024 •

edited

Loading

Jiltseb commented Jun 29, 2024 •

edited

Loading

MahmoudAshraf97 commented Jun 29, 2024 •

edited

Loading

Jiltseb commented Jul 1, 2024 •

edited

Loading

trungkienbkhn commented Jul 2, 2024 •

edited

Loading

MahmoudAshraf97 commented Jul 2, 2024 •

edited

Loading

Jiltseb commented Jul 3, 2024 •

edited

Loading

MahmoudAshraf97 commented Jul 3, 2024 •

edited

Loading

Jiltseb commented Jul 4, 2024 •

edited

Loading

Jiltseb commented Jul 8, 2024 •

edited

Loading