Add option to carry initial_prompt with the sliding window #2343

kittsil · 2024-09-18T03:11:58Z

Background
Whisper's transcribe() struggles with contextual proper nouns if they appear after the initial prompt has been consumed; see some experimental results here. This solves that issue by allowing the initial "context" prompt to be carried as the sliding window moves through the audio.

Changes
Add an option carry_initial_prompt = False to whisper.transcribe().

When carry_initial_prompt is set to True, initial_prompt is prepended to each internal decode() call's prompt. If there is not enough context space at the start of the prompt, the prompt is left-sliced to make space.

kittsil · 2024-09-18T04:10:37Z

There are outstanding issues with this PR:

I have not found the definition of the 224 context token length.
It prepends the initial_prompt to itself before enough tokens have been generated, resulting in a predilection toward looping.
I have not written tests.

Closing this PR since I can't find a way to move it to draft.

ryanheise · 2024-09-18T04:16:25Z

Closing this PR since I can't find a way to move it to draft.

How to: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/changing-the-stage-of-a-pull-request

ryanheise · 2024-09-18T04:29:29Z

Also a relevant discussion here: #1040 (comment)

I have not found the definition of the 224 context token length.

It's part of the model dimensions itself, actually 448 tokens total, and half that for the prompt. The logic is in decoding.py if you look for self.n_ctx: int = model.dims.n_text_ctx and look for the references to it.

Add an option `carry_initial_prompt = False` to `whisper.transcribe()`. When set to `True`, `initial_prompt` is prepended to each internal `decode()` call's `prompt`. If there is not enough context space at the start of the prompt, the prompt is left-sliced to make space.

kittsil · 2024-09-19T04:21:39Z

@ryanheise Thank you for your input; it was helpful. Do you mind providing any additional feedback?

Aside: I did find the left-slice in the code, and it turns out that the docs are wrong, as actually the maximum prompt length is 223!

Confirming with the medium.en model...

>>> medium = torch.load('/home/kittsil/.cache/whisper/medium.en.pt')
>>> medium['dims']
{'n_mels': 80, 'n_vocab': 51864, 'n_audio_ctx': 1500, 'n_audio_state': 1024, 'n_audio_head': 16, 'n_audio_layer': 24, 'n_text_ctx': 448, 'n_text_state': 1024, 'n_text_head': 16, 'n_text_layer': 24}
>>> medium['dims']['n_text_ctx'] // 2 - 1
223

kittsil closed this Sep 18, 2024

kittsil reopened this Sep 19, 2024

kittsil closed this Sep 19, 2024

kittsil force-pushed the patch-1 branch from 05f6534 to 32d55d5 Compare September 19, 2024 03:55

Kittsil and others added 3 commits September 18, 2024 22:58

Prevent redundant initial_prompt_tokens

fae8ede

Merge branch 'openai:main' into patch-1

207f5b9

kittsil reopened this Sep 19, 2024

Revert unnecessary .gitignore change

afeccc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to carry initial_prompt with the sliding window #2343

Add option to carry initial_prompt with the sliding window #2343

kittsil commented Sep 18, 2024 •

edited

Loading

kittsil commented Sep 18, 2024

ryanheise commented Sep 18, 2024

ryanheise commented Sep 18, 2024

kittsil commented Sep 19, 2024 •

edited

Loading

Add option to carry initial_prompt with the sliding window #2343

Are you sure you want to change the base?

Add option to carry initial_prompt with the sliding window #2343

Conversation

kittsil commented Sep 18, 2024 • edited Loading

kittsil commented Sep 18, 2024

ryanheise commented Sep 18, 2024

ryanheise commented Sep 18, 2024

kittsil commented Sep 19, 2024 • edited Loading

kittsil commented Sep 18, 2024 •

edited

Loading

kittsil commented Sep 19, 2024 •

edited

Loading