Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish short form / long from generation integration in Whisper #32263

Open
kamilakesbi opened this issue Jul 27, 2024 · 0 comments
Open

Finish short form / long from generation integration in Whisper #32263

kamilakesbi opened this issue Jul 27, 2024 · 0 comments
Labels
Feature request Request for a new feature

Comments

@kamilakesbi
Copy link
Contributor

kamilakesbi commented Jul 27, 2024

Feature request

Unify the output format to Whisper Generate method for short form/long form generation.

Motivation

In PR #30984, short-form and long-form generation in Whisper were unified so that both benefit from generation with fallback.

However, the output to the generate method's format still varies depending on whether we're doing short form or long form generation, as we can see in this line.

  • For short form generation the output format can be either a torch tensor containing the sequence of token ids or an instance of ModelOutput with additional information (attention masks, hidden states, ...) if return_dict_in_generate is set to True (we can now also use return_segments with short form generation).

  • For long form generation the output is either a torch tensor with the sequence of token ids, or a dict containing the sequences of token ids and a list of all segments if return_segments is set to True. Note that if both return_dict_in_generate and return_segments are set to true, the additional information (attention masks, hidden states) will be contained in segments. However, at the moment we can't get an instance of ModelOutput as output with long form generation.

Should we work on this ?

Ideally, we should also unify the output format for the Whisper generate method so that users don't have to distinguish between short and long form audio. They should only have to specify wether they want to perform sequential generation (non chunked) or parallel generation (chunked) with the pipeline.

The aim of PR #30984 was to implement all the modifications to allow generation with fallback for short form audios without breaking Backward Compatibility on main. If we further unify the output format, we would break backward compatibility and have to adapt several tests.

cc @sanchit-gandhi @ArthurZucker Do you think we should complete the unification of Whisper Generation by unifying the output format?

@kamilakesbi kamilakesbi added the Feature request Request for a new feature label Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

1 participant