Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to carry initial_prompt with the sliding window #2343

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

kittsil
Copy link

@kittsil kittsil commented Sep 18, 2024

Background
Whisper's transcribe() struggles with contextual proper nouns if they appear after the initial prompt has been consumed; see some experimental results here. This solves that issue by allowing the initial "context" prompt to be carried as the sliding window moves through the audio.

Changes
Add an option carry_initial_prompt = False to whisper.transcribe().

When carry_initial_prompt is set to True, initial_prompt is prepended to each internal decode() call's prompt. If there is not enough context space at the start of the prompt, the prompt is left-sliced to make space.

@kittsil
Copy link
Author

kittsil commented Sep 18, 2024

There are outstanding issues with this PR:

  1. I have not found the definition of the 224 context token length.
  2. It prepends the initial_prompt to itself before enough tokens have been generated, resulting in a predilection toward looping.
  3. I have not written tests.

Closing this PR since I can't find a way to move it to draft.

@kittsil kittsil closed this Sep 18, 2024
@ryanheise
Copy link
Contributor

@ryanheise
Copy link
Contributor

Also a relevant discussion here: #1040 (comment)

I have not found the definition of the 224 context token length.

It's part of the model dimensions itself, actually 448 tokens total, and half that for the prompt. The logic is in decoding.py if you look for self.n_ctx: int = model.dims.n_text_ctx and look for the references to it.

Kittsil and others added 3 commits September 18, 2024 22:58
Add an option `carry_initial_prompt = False` to `whisper.transcribe()`.
When set to `True`, `initial_prompt` is prepended to each internal `decode()` call's `prompt`.
If there is not enough context space at the start of the prompt, the prompt is left-sliced to make space.
@kittsil kittsil reopened this Sep 19, 2024
@kittsil
Copy link
Author

kittsil commented Sep 19, 2024

@ryanheise Thank you for your input; it was helpful. Do you mind providing any additional feedback?


Aside: I did find the left-slice in the code, and it turns out that the docs are wrong, as actually the maximum prompt length is 223!

Confirming with the medium.en model...

>>> medium = torch.load('/home/kittsil/.cache/whisper/medium.en.pt')
>>> medium['dims']
{'n_mels': 80, 'n_vocab': 51864, 'n_audio_ctx': 1500, 'n_audio_state': 1024, 'n_audio_head': 16, 'n_audio_layer': 24, 'n_text_ctx': 448, 'n_text_state': 1024, 'n_text_head': 16, 'n_text_layer': 24}
>>> medium['dims']['n_text_ctx'] // 2 - 1
223

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants