Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing the need for jsons dependency #18

Merged
merged 25 commits into from
Jul 1, 2024
Merged

removing the need for jsons dependency #18

merged 25 commits into from
Jul 1, 2024

Conversation

MahmoudAshraf97
Copy link

@MahmoudAshraf97 MahmoudAshraf97 commented Jun 20, 2024

  • removing the need for jsons dependency
  • remove tokenizer reinitialization when language or task is changed since it can be changed directly without creating a new tokenizer
  • remove the need for a separate encode_batched function
  • enable word timestamps using original functions
  • remove PyAV and use torchaudio instead, this fixes the memory leak in the resampler
  • use gpu in feature extraction for 35x speedup, I also removed the old feature extraction since it has no use case, so there's no need to keep it while we have a faster one that has the same accuracy

the diff seems to be messed up for a reason, but the changes are minimal

@MahmoudAshraf97
Copy link
Author

@Jiltseb

…eak in the resampler

* use gpu in feature extraction for 35x speedup
@Jiltseb Jiltseb self-requested a review June 24, 2024 07:21
Copy link
Collaborator

@Jiltseb Jiltseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good in general. The following changes are important:

  1. Since torch audio is not well-maintained(your comment), it posses a risk to remove pyAV completely from the system. Also, pyAV can still be used for media formats not supported by torchaudio.

  2. We should keep the combine_words function in transcribe to split the 30 sec chunks to smaller segments based on VAD, providing a better visualization and usability for subtitling.

  3. Make sure to re-run the benchmark and get the WER numbers right (based on comments)

  4. Minor (see comments).

faster_whisper/audio.py Show resolved Hide resolved
whisper's original torch implementation with 1e-5 tolerance. Additionally, faster
feature extraction option using kaldi fbank features are available if torchaudio is
available.
whisper's original torch implementation with 1e-5 tolerance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not whisper's original torch implementation. Update the docstring to specify torchaudio-based FE

faster_whisper/transcribe.py Show resolved Hide resolved
faster_whisper/transcribe.py Show resolved Hide resolved
tests/test_transcribe.py Show resolved Hide resolved
faster_whisper/transcribe.py Outdated Show resolved Hide resolved
@Jiltseb Jiltseb requested a review from trungkienbkhn June 25, 2024 07:35
@trungkienbkhn
Copy link

trungkienbkhn commented Jun 25, 2024

When testing, I encountered problem with word timestamps, they are a bit weird:
[68.42s -> 68.92s] => [41.48s -> 42.58s] => [68.92s -> 71.76s]
[140.60s -> 146.26s] => [121.52s -> 122.52s] => [126.24s -> 127.54s]
...

[0.00s -> 1.88s]  Would you see what you can get for this, please?
[5.36s -> 7.44s]  Not the royal ring, your highness.
[10.50s -> 10.94s]  Shh!
[12.12s -> 12.60s]  Do you want to help out?
[27.38s -> 28.44s]  Excuse me.
[45.36s -> 49.24s]  Is that man actually royalty?
[52.74s -> 54.02s]  No, madame.
[55.54s -> 57.40s]  But you called him your highness.
[59.42s -> 60.74s]  It was a faux pas.
[62.42s -> 63.34s]  Please forget about it.
[64.20s -> 67.12s]  You can trust me.
[68.42s -> 68.92s]  I won't tell.
[41.48s -> 42.58s]  Madame, I am...
[68.92s -> 71.76s]  Extraordinary man of destiny.
[92.46s -> 96.00s]  Your highness.
[100.20s -> 101.82s]  Your highness, don't be alarmed.
[105.18s -> 106.26s]  I can be trusted.
[108.82s -> 110.56s]  Are you one of my subjects?
[112.96s -> 113.60s]  No.
[114.38s -> 118.64s]  I'm an American, Fanny Eubanks of Omaha.
[121.44s -> 122.66s]  I couldn't help overhearing.
[126.06s -> 126.38s]  If you're in trouble and there's some way I can help,
[98.92s -> 101.24s]  thank you, but I cannot accept.
[105.72s -> 108.84s]  You've already risked too much just in speaking to me.
[112.22s -> 118.18s]  I still want to help.
[123.90s -> 127.10s]  You must understand, I have powerful enemies.
[131.92s -> 135.42s]  They may be watching even as we...
[140.60s -> 146.26s]  My God, you're attractive.
[121.52s -> 122.52s]  It's late.
[126.24s -> 127.54s]  I must go.
[128.92s -> 134.88s]  As he left?
[141.82s -> 143.78s]  Yes, just a moment to calm.
[148.94s -> 149.74s]  Good.
[151.90s -> 152.28s]  Please.
[153.72s -> 158.90s]  You must tell me where he lives.
[164.14s -> 166.36s]  I feel it only fair to wonder.
[168.64s -> 169.78s]  I know.
[170.68s -> 174.12s]  He told me he has powerful enemies.
[176.62s -> 178.06s]  There may also be an emotional risk.
[182.24s -> 182.56s]  You see, his highness has been a widower for five years.
[158.92s -> 160.20s]  For five years?
[174.20s -> 177.00s]  Please, your highness.
[181.04s -> 187.32s]  Fanny, the Freedom Fighters thank you.
[195.52s -> 197.20s]  This is for the overhead.
[200.52s -> 202.24s]  This goes to you, Arthur.
[206.08s -> 208.70s]  This goes to you, Andre.
[211.58s -> 212.50s]  This goes to me, which means it's time to go to Zurich.
[188.92s -> 190.18s]  Thank you.
model = WhisperModel("large-v3", device="cuda")
segments, info = model.transcribe(audio_path, word_timestamps=True)
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Could you take a look ?

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jun 25, 2024

Yes, The error in timestamps looks to repeat after every batch, it starts a lot earlier than the final timestamp of the previous batch.

@MahmoudAshraf97
Copy link
Author

I managed to reproduce this on several audio files but after pulling a new copy of the repo I couldn't anymore, can someone test again and provide a minimum example?

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jun 25, 2024

Used the example faster-whisper/tests/data/physicsworks.wav to test both batched and sequential versions.

Batched version:

[0.00s -> 27.44s] Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms, and I can lift it up one meter, which I have done now. That means I've done work. MGH is the work I have done, believe me. I've increased the potential energy of this object. 15 times 10 is about 150 joules. If I let it fall,
[28.11s -> 53.91s] then that will be converted to kinetic energy. If I would let it swing from one meter height, and you would be there, and it would hit you, you'd be dead. 150 joules is enough to kill you. They use these devices. It's called a wrecker ball. They use them to demolish buildings. You lift up a very heavy object, even heavier than this,
[54.35s -> 82.93s] and then you let it go, you swing it, thereby converting gravitational potential energy into kinetic energy, and that way you can demolish a building. You just let it hit... and it breaks a building. And that's the whole idea of wrecking. So you're using, then, the conversion of gravitational potential energy to kinetic energy.
[84.21s -> 113.45s] Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger. If I release it from this height,
[114.17s -> 137.71s] and it swings, then when it reaches here, it could not be higher. There is a conversion from gravitational potential energy to kinetic energy back to gravitational potential energy, and it will come to a stop here. And when it swings back, it should not be able to reach any higher, provided that I do not give this object an initial speed when I stand here.
[141.28s -> 169.74s] the conservation of mechanical energy for 100%. I may not trust myself. I'm going to release this object, and I hope I will be able to do it at zero speed, so that when it comes back, it may touch my chin, but it may not crush my chin. I want you to be extremely quiet, because this is no joke. If I don't succeed in giving it zero speed,
[170.18s -> 187.62s] Then, this will be my last lecture. I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. 3, 2, 1, 0.
[200.73s -> 202.73s] Physics works, and I'm still alive.

Did not see timestamps issue for this test. The timestamps are still not sparse, are you yet to commit those changes?

Sequential version:

[0.00s -> 8.98s]  Now I want to return to the conservation of mechanical energy. I have here a pendulum.
[20.48s -> 25.00s]  I have an object that weighs 15 kilograms, and I can lift it up one meter, which I have
[30.22s -> 36.12s]  done now. That means I've done work. MGH is the work I have done, believe me. I've increased
[42.38s -> 48.14s]  the potential energy of this object. 15 times 10 is about 150 joules. If I let it fall,
[48.14s -> 53.56s]  I use them to demolish buildings. You lift up a very heavy object, even heavier than
[59.98s -> 66.12s]  this, and then you let it go. You swing it, thereby converting gravitational potential
[73.08s -> 82.42s]  energy into kinetic energy, and that way you can demolish a building. You just let it hit,
[93.42s -> 99.78s]  and it breaks a building. And that's the whole idea of wrecking. So you are using then the
[100.10s -> 110.40s]  that bob from a certain height, then that bob can never come back to a point where the height
[120.98s -> 128.02s]  is any larger. If I release it from this height, and it swings, then when it reaches here,
[135.30s -> 140.78s]  it could not be higher. There is a conversion from gravitational potential energy to kinetic
[146.44s -> 149.92s]  energy back to gravitational potential energy, and it will come to a stop here. And when it
[156.32s -> 163.90s]  comes back, it may touch my chin, but it may not crush my chin. I want you to be extremely
[178.48s -> 185.58s]  quiet, because this is no joke. If I don't succeed in giving it zero speed, then this
[193.10s -> 200.52s]  will be my last lecture. I will close my eyes. I don't want to see this. So please be very
[201.40s -> 201.58s]  quiet.
[230.28s -> 230.76s]  Physics works, and I'm still alive.

Surprisingly, the timestamps here are wrong, as it ends at 230 sec, longer than the total length of the audio.

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jun 25, 2024

Ah I had to remove and install latest one again, it shows similar results now. I will compare with previous versions and get back.

@MahmoudAshraf97
Copy link
Author

Yes, I managed to get the batched version to work with without_timestamps=False and it outputs the same segments as the non batched version, but it's much slower so I'm still trying to investigate this, from what I understand, the generation step is three times or more slower when timestamp tokens are generated, and this slowdown increases when batch size is increased

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jun 25, 2024

It throws error : [token for token in subsegment["tokens"] if token < tokenizer.eot] TypeError: string indices must be integers when using word_timestamps to True in batched version. Can you check that?

@MahmoudAshraf97
Copy link
Author

MahmoudAshraf97 commented Jun 25, 2024

I still have uncommitted changes so maybe it's fixed in them
this is the result for batched version with both segment timestamps and word timestamps, large-v3 model
notice that the last subsegment in each segment is skipped so I'm still working on it

[0.00s -> 8.86s]  Now I want to return to the conservation of mechanical energy. I have here a pendulum.
[20.36s -> 24.88s]  I have an object that weighs 15 kilograms and I can lift it up one meter, which I have
[29.90s -> 35.12s]  done now. That means I've done work. MGH is the work I have done, believe me. I've increased
[28.11s -> 30.57s]  then that will be converted to kinetic energy.
[35.83s -> 40.91s]  If I would let it swing from one meter height,
[46.21s -> 50.51s]  and you would be there and it would hit you, you'd be dead.
[54.17s -> 57.41s]  150 joules is enough to kill you.
[61.63s -> 64.99s]  They use these devices, they're called a racquetball,
[68.59s -> 68.91s]  they use them to demolish buildings.
[54.35s -> 59.77s]  And then you let it go, you swing it, thereby converting gravitational potential energy
[66.69s -> 71.85s]  into kinetic energy and that way you can demolish a building.
[77.51s -> 85.07s]  You just let it hit and it breaks a building and that's the whole idea of wrecking.
[83.87s -> 93.17s]  Now, I am such a strong believer of the conservation of mechanical energy that I am willing to
[103.93s -> 108.49s]  put my life on the line.
[113.61s -> 124.47s]  If I release that bulb from a certain height, then that bulb can never come back to a point
[135.53s -> 136.53s]  where the height is any larger.
[114.17s -> 119.85s]  and it swings, then when it reaches here it could not be higher. There is a conversion
[126.39s -> 131.03s]  from gravitational potential energy to kinetic energy back to gravitational potential energy
[135.99s -> 140.47s]  and it will come to a stop here. And when it swings back it should not be able to reach
[141.28s -> 144.38s]  the conservation of mechanical energy, for 100%.
[150.30s -> 154.32s]  I may not trust myself.
[158.32s -> 160.72s]  I'm going to release this object,
[163.32s -> 167.16s]  and I hope I will be able to do it at zero speed,
[171.20s -> 174.24s]  so that when it comes back, it may touch my chin,
[177.56s -> 181.00s]  but it may not crush my chin.
[185.04s -> 186.74s]  I want you to be extremely quiet, because this is no joke.
[170.18s -> 178.52s]  this will be my last lecture. I will close my eyes. I don't want to see this. So please
[200.73s -> 202.49s]  Physics works, and I'm still alive.

without_timestamps=True

[0.00s -> 27.24s]  Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms and I can lift it up one meter, which I have done now. That means I have done work. MGH is the work I have done, believe me. I have increased the potential energy of this object. Fifteen times ten is about 150 joules. If I let it fall,
[28.11s -> 53.79s]  then that will be converted to kinetic energy. If I would let it swing from one meter height and you would be there and it would hit you, you'd be dead. 150 joules is enough to kill you. They use these devices, it's called a racquetball, they use them to demolish buildings. You lift up a very heavy object, even heavier than this,
[54.35s -> 82.81s]  And then you let it go, you swing it, thereby converting gravitational potential energy into kinetic energy. And that way you can demolish a building. You just let it hit and it breaks a building. And that's the whole idea of wrecking. So you are using then the conversion of gravitational potential energy to kinetic energy.
[83.87s -> 113.27s]  Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger. If I release it from this height,
[114.17s -> 140.35s]  and it swings then when it reaches here it could not be higher. There is a conversion from gravitational potential energy to kinetic energy back to gravitational potential energy and it will come to a stop here. And when it swings back it should not be able to reach any higher provided that I do not give this object an initial speed when I stand here. I trust
[141.28s -> 169.40s]  the conservation of mechanical energy for 100%. I may not trust myself. I'm going to release this object, and I hope I will be able to do it at zero speed, so that when it comes back, it may touch my chin, but it may not crush my chin. I want you to be extremely quiet, because this is no joke. If I don't succeed in giving it zero speed,
[170.44s -> 187.50s]  then this will be my last lecture. I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. Three, two, one, zero.
[200.73s -> 202.49s]  Physics works, and I'm still alive.

@MahmoudAshraf97
Copy link
Author

@hargunmujral running multiple transcripe calls using the same pipeline will cause this for sure, the correct method is to use a either a single pipeline with maximum batch size or multiple pipelines referencing a single model

@Jiltseb I'm ready

@Jiltseb Jiltseb self-requested a review June 27, 2024 20:03
@Jiltseb
Copy link
Collaborator

Jiltseb commented Jun 28, 2024

@MahmoudAshraf97 I just compared default batched transcription and its variant when both word_timestamps=True and without_timestamps=False are set. There are some missing words in the second version. Can you check?

Default:

[0.00s -> 27.79s]  Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms and I can lift it up one meter, which I have done now. That means I have done work. mgh is the work I have done, believe me. I have increased the potential energy of this object. Fifteen times ten is about 150 joules. If I let it fall,
[28.11s -> 54.20s]  then that will be converted to kinetic energy. If I would let it swing from one meter height and you would be there and it would hit you, you'd be dead. 150 joules is enough to kill you. They use these devices, it's called a racquetball, they use them to demolish buildings. You lift up a very heavy object, even heavier than this,
[54.35s -> 83.17s]  And then you let it go, you swing it, thereby converting gravitational potential energy into kinetic energy. And that way you can demolish a building. You just let it hit and it breaks a building. And that's the whole idea of wrecking. So you are using then the conversion of gravitational potential energy to kinetic energy.
[83.87s -> 113.77s]  Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger. If I release it from this height,
[114.17s -> 140.95s]  and it swings then when it reaches here it could not be higher. There is a conversion from gravitational potential energy to kinetic energy back to gravitational potential energy and it will come to a stop here. And when it swings back it should not be able to reach any higher provided that I do not give this object an initial speed when I stand here. I trust
[141.28s -> 170.04s]  the conservation of mechanical energy for 100%. I may not trust myself. I'm going to release this object, and I hope I will be able to do it at zero speed, so that when it comes back, it may touch my chin, but it may not crush my chin. I want you to be extremely quiet, because this is no joke. If I don't succeed in giving it zero speed,
[170.18s -> 187.90s]  then this will be my last lecture. I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. Three, two, one, zero.
[200.73s -> 202.90s]  Physics works, and I'm still alive.
Elapsed time: 3.1877

Default with word_timestamps=True and without_timestamps=False.

[0.00s -> 8.86s]  Now I want to return to the conservation of mechanical energy. I have here a pendulum.
[9.74s -> 14.52s]  I have an object that weighs 15 kilograms and I can lift it up one meter, which I have
[14.52s -> 20.10s]  done now. That means I have done work. MGH is the work I have done, believe me. I have
[20.10s -> 27.26s]  increased the potential energy of this object. 15 x 10 is about 150 joules. If I let it fall,
[28.11s -> 30.57s]  then that will be converted to kinetic energy.
[31.59s -> 36.35s]  If I would let it swing from one meter height,
[37.03s -> 39.49s]  and you would be there and it would hit you, you'd be dead.
[40.65s -> 42.53s]  150 joules is enough to kill you.
[43.89s -> 46.91s]  They use these devices, it's called a racquetball,
[47.47s -> 49.21s]  they use them to demolish buildings.
[49.97s -> 53.79s]  You lift up a very heavy object, even heavier than this,
[54.35s -> 59.79s]  And then you let it go, you swing it, thereby converting gravitational potential energy
[60.35s -> 64.95s]  into kinetic energy and that way you can demolish a building.
[65.39s -> 73.53s]  You just let it hit and it breaks a building and that's the whole idea of wrecking.
[74.95s -> 82.81s]  So you are using then the conversion of gravitational potential energy to kinetic energy.
[83.87s -> 93.17s]  Now, I am such a strong believer of the conservation of mechanical energy that I am willing to
[93.83s -> 95.73s]  put my life on the line.
[97.85s -> 108.71s]  If I release that bulb from a certain height, then that bulb can never come back to a point
[109.47s -> 110.93s]  where the height is any larger.
**[114.17s -> 119.85s]  and it swings,** then when it reaches here it could not be higher. There is a conversion : MISSING WORD
[119.85s -> 124.95s]  from gravitational potential energy to kinetic energy back to gravitational potential energy
[124.95s -> 129.65s]  and it will come to a stop here. And when it swings back it should not be able to reach
[130.22s -> 137.55s]  any higher, provided that I do not give this object an initial speed when I stand here.
**[141.28s -> 144.36s]  the conservation of mechanical energy for 100%** : MISSING WORD
[145.42s -> 146.84s]  I may not trust myself.
[149.44s -> 151.16s]  I'm going to release this object,
[151.74s -> 154.54s]  and I hope I will be able to do it at zero speed,
[155.62s -> 158.34s]  so that when it comes back, it may touch my chin,
[159.02s -> 160.32s]  but it may not crush my chin.
[162.24s -> 165.44s]  I want you to be extremely quiet, because this is no joke.
[166.30s -> 169.40s]  If I don't succeed in giving it zero speed,
[170.42s -> 178.52s]  this will be my last lecture. I will close my eyes. I don't want to see this. So please
[178.52s -> 187.50s]  be very quiet. I almost didn't sleep all night. Three, two, one, zero.
[200.73s -> 202.51s]  Physics works, and I'm still alive.
Elapsed time: 5.2835

@MahmoudAshraf97
Copy link
Author

technically speaking, the words are not missing because they were never generated, given the same encoder input

result = self.model.generate(encoder_output, [prompt] * batch_size)
tokenizer.decode(result[0].sequences_ids[0]) # 62 tokens
>>> ' Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger.'
result = self.model.generate(encoder_output, [prompt+[50364]] * batch_size) # no timestamps token added to the prompt
tokenizer.decode(result[0].sequences_ids[0]) # 61 tokens
>>> ' Now, I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height, then that bulb can never come back to a point where the height is any larger. If I release it from this height,'

note that the length of prompt + generation is 65 tokens in both cases
the strange thing is that when the timestamp is enabled, appending tokenizer.language or tokenizer.language token to the prompt seems to fix the issue in this specific segment, but not in all segments

I guess this is a problem on ctranslate2 side

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jun 28, 2024

I see the issue on the sequential side as well (it hallucinates at some places even though it does not miss them). But the original faster-whisper version (sequential) does a perfect job, it does not matter if it is withput_timestamps or not. Interestingly if I enable torchaudio feature extraction, it still hallucinates in sequential. The features have different dynamic ranges,

  1. I would suggest keeping enable_ta_fe flag and original feature extraction for the sequential version.
  2. For the batched version, the issue can be alleviated via vad-based segmentation, instead of using without_timestamps=True or by still using the original feature extraction (with a speed trade-off).

@MahmoudAshraf97
Copy link
Author

reimplementing the original feature extraction in torch instead of numpy can be faster that the kaldi implementation, but it results in a slightly higher WER on the included benchmark, so we keep the current transcription performance of faster whisper without the speed trade-off, btw the kaldi implementation might perform worse on the included file, but it performed better on other files too, I guess the final choice between the two should be based on extensive benchmarking

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jun 29, 2024

We did an internal benchmarking to compare them already and the results are the reason why we proposed to have both FE in the package based on speed vs WER trade-off.

Batched whisper torchaudio
WER: 6.5
Speed: 86.6x
Batched whisper Default
WER: 6.14
Speed: 65.8x

I will still recommend to keep both versions to avoid some new transcription errors popping up once the main PR is merged.

@trungkienbkhn what is your opinion?

@MahmoudAshraf97
Copy link
Author

MahmoudAshraf97 commented Jun 29, 2024

I suggest rerunning the benchmarks again (9f78b36 vs d95c7a6) as there should be no speed difference between torchaudio and the implementation in this PR, and go with the one with the lowest WER, I'd vote against keeping two feature extraction implementations as the difference in performance is negligible and would be a hassle to maintain (ignoring the fact that most users will not bother with testing both and choosing the best one)

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jul 1, 2024

Makes sense. I am waiting for confirmation from @trungkienbkhn before a final review.

Copy link
Collaborator

@Jiltseb Jiltseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some cosmetic changes are added in comments.

feature_size=80,
sampling_rate=16000,
hop_length=160,
chunk_length=30,
n_fft=400,
):
if device == "auto":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference in speed if FE is performed in CPU vs GPU? This needs to be evaluated before setting the default (for both short and long audios) in batched and sequential cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20s audio
5.51 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # CPU
1.1 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) # GPU

10min audio
76.3 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # CPU
8.06 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # GPU

around 5x speedup for short audio and 10x for long audio

faster_whisper/transcribe.py Show resolved Hide resolved
faster_whisper/feature_extractor.py Show resolved Hide resolved
@Jiltseb Jiltseb merged commit eff81f5 into mobiusml:master Jul 1, 2024
@trungkienbkhn
Copy link

trungkienbkhn commented Jul 2, 2024

I suggest rerunning the benchmarks again (9f78b36 vs d95c7a6) as there should be no speed difference between torchaudio and the implementation in this PR, and go with the one with the lowest WER, I'd vote against keeping two feature extraction implementations as the difference in performance is negligible and would be a hassle to maintain (ignoring the fact that most users will not bother with testing both and choosing the best one)

I ran benchmarks for this with GPU H100:

Sequential FW

  1. Speed
  • Current: 36.459s
  • Torchaudio kaldi: 35.749s
  1. WER
  • Current: 3.212
  • Torchaudio kaldi: 2.002

Batched FW

  1. Speed
  • Current: 7.084s
  • Torchaudio kaldi: 7.275s
  1. WER
  • Current: 1.658
  • Torchaudio kaldi: 1.773

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jul 2, 2024

And WER with Batched FW?
Note the typo- speech for total processing time.

@trungkienbkhn
Copy link

And WER with Batched FW? Note the typo- speech for total processing time.

I just updated the wer benchmark. For torchaudio kaldi, I replaced the current logic with logic in here

@MahmoudAshraf97
Copy link
Author

MahmoudAshraf97 commented Jul 2, 2024

Are these numbers using librispeech_asr if so, I guess we need to test it on longform audio too, so far the conclusion is that kaldi is superior in sequential and almost the same in batched, and both have the same speed
I suggest running the whole benchmark suite here

edit: This is a fork with updated parameters, ready to run directly

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jul 2, 2024

I think it is librispeech_asr looking at the numbers. I have tested it on long-form audio internal benchmark and the open-source benchmark I shared in the original PR.

The only change was for WER with torchaudio-kaldi from 6.5 to 6.2. But I would consider it as negligible difference, as a seasoned speech researcher (Many reasons behind this intuition).

open_asr_eval is NOT a long-form audio dataset, infact, the average duration of samples comes under 10 sec in most of them. It is ONLY good for evaluating short-form audio. For long-form audio, we can use a subset of Youtube-Commons, It provides a realistic measure the speed, although the WER values need to be taken with a pinch of salt (owing to errors in Human annotation).

In my opinion, the performance is similar for both short and long-form audios and we should be able to choose torchaudio.

@trungkienbkhn
Copy link

Yes I used librispeech_asr. I will try with mobiuslabsgmbh/youtube-commons-asr-eval

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jul 3, 2024

I have done this comparison before the PR and now reran the batched whisper on our youtube-commons version, just to make sure the speed and WER stays the same (In the original batched whisper PR, the default feature extraction was torchaudio)

#Before (torchaudio)
#average speed: 104x
#Test WER: 0.13164

#Before (systran FE)
#average speed: 63.5x
#Test WER: 0.13143

After changes (torchaudio)
#average speed: 103.7x
#Test WER: 0.13125

I also did the comparison between sequential versions:

#with systran FE
# average speed: 20.1x
#WER: 14.6

#with torchaudio
#average speed: 28.5x
#WER: 14.4

Note: HF whisper also now uses torchaudio based FE whenever it is available

@trungkienbkhn
Copy link

I ran WER benchmark with youtube-common-asr dataset

  • WER: 13.287 torch kaldi
  • WER: 13.120 current

There is not too much difference.

Note: HF whisper also now uses torchaudio based FE whenever it is available

Yes, HF also use torchaudio kaldi. So I think should keep both versions of FE: current and a new option enable_ta_fe (maybe with default value is True). Do you have any other opinions?

@MahmoudAshraf97
Copy link
Author

MahmoudAshraf97 commented Jul 3, 2024

I ran WER benchmark with youtube-common-asr dataset

  • WER: 13.287 torch kaldi
  • WER: 13.120 current

There is not too much difference.

Note: HF whisper also now uses torchaudio based FE whenever it is available

Yes, HF also use torchaudio kaldi. So I think should keep both versions of FE: current and a new option enable_ta_fe (maybe with default value is True). Do you have any other opinions?

HF doesn't use that implementation for whisper, it's used for other models, these are the implementations used for whisper which are a numpy and a torch implementations of the original whisper implementaion.

Regardless of that, HF can maintain multiple implementations because it's a multi-framework library and whisper can still be used without having pytorch installed, hence the need for a numpy implementation, but this is not the case for faster-whisper where torch must exist. Having two implementations is counter intuitive since they serve the same purpose with very close performance and no clear guide when to use one over the other, one can think of that as functionality duplication which is not a good practice

another point that got my attention, is that HF kaldi uses 16-bit signed integers while we use 32-bit floats, do you think that would make a difference?
to test we should add this line in the kaldi branch just before the fbank function
waveform = waveform * (2**15)

this is the result without scaling

Evaluating...: 1023it [09:30,  1.79it/s]
WER: 2.374

this is the result with scaling

Evaluating...: 1023it [09:27,  1.80it/s]
WER: 2.656

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jul 3, 2024

As @MahmoudAshraf97 mentioned, my opinion is to go with ONLY torchaudio since it has the same functionality at an increased speed.

I would say, we can stick with fp32 for the time-being as int16 does not add any reasonable benefit and would require further benchmarking and visual inspection.

@trungkienbkhn
Copy link

After several different benchmark runs, I noticed that torchaudio kaldi is faster, while WER doesn't change much (increased slightly but not significantly, compared with current torch FE). So I think speed should be a priority. That's also why the project name is faster-whisper. In my opinion, I agree with just using torchaudio.
@Jiltseb , could you rollback code to use torchaudio for FE ? Tks.

@MahmoudAshraf97
Copy link
Author

Speed is indeed a priority, but to a certain extent IMO, faster-whisper is already the fastest implementation available after TensorRT-LLM, so squeezing more speed while trading off WER isn't a wise decision if you ask me, given that the speed difference between the two is 104x vs 103.7x as you stated in an earlier comment

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jul 4, 2024

@trungkienbkhn I think there is a misunderstanding here:

  1. We all agree original faster whisper feature extraction is slower and there is no great benefit in WER.
  2. Then there is torchaudio-kaldi based feature extraction that has faster speed and similar WER.
  3. The current FE implementation provided by @MahmoudAshraf97 uses torch.stft (like in whisperx), which is also faster than the original faster whisper implementation and similar WER.

Now we have to select between torchaudio FE and torch FE both have similar WER and speed (negligible differences).

I don't see any problem using the current implementation as it is. If there are no objections, let's finalize on this and move to the main PR for any further reviews/comments before merging.

@Jiltseb
Copy link
Collaborator

Jiltseb commented Jul 8, 2024

@trungkienbkhn I see that the main PR is still not merged. When can this be done?

@Jiltseb
Copy link
Collaborator

Jiltseb commented Aug 14, 2024

@trungkienbkhn Discussing here on the further changes.

There is a set of different opinions regarding the latest PR we added. The reason was the apparent slowness in speed for CPU versions caused when there is only a limited number of cores. There was dissatisfaction owing to the introduction of additional dependencies.

As you already know, @MahmoudAshraf97 already removed several of them, including transformers and pyannote (via in-progress PR-Not referencing here now). @MahmoudAshraf97 and myself also implemented versions without torch and it finally has the same dependencies as the original FW. The slowness can be easily addressable with cpu_threads setting back to 0 as the default.

So a set of follow-up PRs should stabilize the big PR that got merged.
Can we discuss the feasibility here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants