[Distil-Whisper] Add support for Distil-Whisper #1423

patrickvonplaten · 2023-11-03T10:40:55Z

Hey,

We've recently released two Distil-Whisper checkpoints:

Large-v2-32-2 which is a 32-encoder layer, 2-decoder layer distilled large-v2 checkpoint
Medium-24-2.en which is a 24-encoder layer, 2-decoder layer distilled medium.en checkpoint

On GPU, we achieve speed-ups of up to 6x compared to the teacher models at relatively minimal degradation in performance.
More information here: https://twitter.com/sanchitgandhi99/status/1719409022246220184

Using your conversion scripts, we've already converted the checkpoints to .cpp format see:

ggml-large-32-2.en.bin

We'd love to collaborate on supporting the checkpoints for this repository as we're really excited to see about the potential speed-ups that can be achieved on optimized C++ code.

It looks like some changes to whisper.cpp will be necessary for such a change (e.g. we should probably define a new model type here?)

@ggerganov would you be interested in adding Distil-Whisper?

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2023-11-03T10:57:28Z

Linking for visibility: #1414

ggerganov · 2023-11-03T11:47:13Z

Hi @patrickvonplaten - congrats on the release!

I believe I have successfully added initial support for the distilled models in the following PR: #1424

However, I'm worried that for optimal quality, AFAICT these models require an alternative decoding strategy with overlapping chunks for long-form transcriptions. This can take more time to implement and I am not sure yet how to fit it in the existing implementation.

Could you point me to the reference implementation?

I will give it a thought and see if I can come up with a solution in the following days.
For the moment, #1424 should hopefully work as an initial version

patrickvonplaten · 2023-11-07T19:08:30Z

Hey @ggerganov,

The implementation we're using in Transformers actually uses overlapping chunks. We overlap each chunk by 2.5 seconds. Essentially we follow the strategy as described here: https://huggingface.co/blog/asr-chunking using a chunk length of 15 seconds and chunk_stride of 2.5 second (default).

It's all implemented here: https://github.com/huggingface/transformers/blob/ac5d4cf6de24b4f7fa92996e92d1d71dd5411a6a/src/transformers/pipelines/automatic_speech_recognition.py#L135 and the code to run in inference for debugging should be this one: https://github.com/huggingface/distil-whisper/tree/main#long-form-transcription

The other option is to just use openai's codebase: https://github.com/openai/whisper using distil-whisper checkpoints converted into the original format: https://huggingface.co/distil-whisper/distil-large-v2/blob/main/original-model.fp32.bin

Does this help? I'm also working on adding OAI's naively to Transformers for easier debugging but this might take until next week

ggerganov · 2023-11-08T08:32:02Z

Thanks for the links. Will probably look into chunking after I make the v1.5.0 release of whisper.cpp.

rawwerks · 2023-12-13T02:12:28Z

i would like to weigh in from the "end user peanut gallery" that i believe the full implementation of the chunking for distil-whisper would be a major inflection point for the widespread adoption of whisper.cpp. qualitatively, the recent speed improvements were able to help products like MacWhisper get to a point where consumer hardware (M1) can now transcribe short audio faster than you can upload/transcribe/download via a cloud service like Otter or Happyscribe. if we can get the extra 5-6x from distil-whisper, then even hours long transcriptions of meetings, podcasts, etc, could be transcribed in minutes to tens of minutes on consumer hardware (with respectable accuracy (medium or large))

of course everyone would rather transcribe locally for privacy and cost reasons. you have the power to make this practical. everyone will have their own private transcriptionist. we don't need another 10x to make this a UX inflection, just another 5x will seriously change the game.

thank you for the important work that you do!

PoignardAzur · 2024-01-01T13:49:43Z

I haven't managed to run the conversion scripts myself (see #1711).

Is there any chance you could release additional versions, using the GGUF format with the recent quantization options?

ciekawy · 2024-01-10T23:22:17Z

any chances for this to support with https://huggingface.co/Aspik101/distil-whisper-large-v3-pl ?

johnmccombs1 · 2024-01-30T20:50:58Z

I'd love to see this as well. The distil models run so much faster but unfortunately for anything longer than 10-20 seconds, it starts cutting out words/phrases. I tested against a distil model using regular Whisper here https://huggingface.co/spaces/distil-whisper/whisper-vs-distil-whisper with the same audio file and it works nearly flawlessly. But for some reason using it through whisper.cpp creates a large number of errors and words that are cut off or misspelled (I'm assuming it's because it's chunking oddly). Would love to see this fixed.

hlevring · 2024-04-02T04:57:49Z

@patrickvonplaten with the latest release of Distilled V3 my understanding is that Distilled model is no longer exclusively tied to the chunked algorithm as far as I can understand
https://huggingface.co/distil-whisper/distil-large-v3
https://huggingface.co/distil-whisper/distil-large-v3-ggml

So maybe this ticket could be closed? I suppose it mainly remained open to address the chunking?

bobqianic added the high priority Very important issue label Nov 3, 2023

komachi mentioned this issue Nov 9, 2023

Set whisper model via configuration parameter guardian/giant#176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Distil-Whisper] Add support for Distil-Whisper #1423

[Distil-Whisper] Add support for Distil-Whisper #1423

patrickvonplaten commented Nov 3, 2023

patrickvonplaten commented Nov 3, 2023

ggerganov commented Nov 3, 2023

patrickvonplaten commented Nov 7, 2023

ggerganov commented Nov 8, 2023

rawwerks commented Dec 13, 2023 •

edited

Loading

PoignardAzur commented Jan 1, 2024

ciekawy commented Jan 10, 2024

johnmccombs1 commented Jan 30, 2024

hlevring commented Apr 2, 2024

[Distil-Whisper] Add support for Distil-Whisper #1423

[Distil-Whisper] Add support for Distil-Whisper #1423

Comments

patrickvonplaten commented Nov 3, 2023

patrickvonplaten commented Nov 3, 2023

ggerganov commented Nov 3, 2023

patrickvonplaten commented Nov 7, 2023

ggerganov commented Nov 8, 2023

rawwerks commented Dec 13, 2023 • edited Loading

PoignardAzur commented Jan 1, 2024

ciekawy commented Jan 10, 2024

johnmccombs1 commented Jan 30, 2024

hlevring commented Apr 2, 2024

rawwerks commented Dec 13, 2023 •

edited

Loading