You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unify the output format to Whisper Generate method for short form/long form generation.
Motivation
In PR #30984, short-form and long-form generation in Whisper were unified so that both benefit from generation with fallback.
However, the output to the generate method's format still varies depending on whether we're doing short form or long form generation, as we can see in this line.
For short form generation the output format can be either a torch tensor containing the sequence of token ids or an instance of ModelOutput with additional information (attention masks, hidden states, ...) if return_dict_in_generate is set to True (we can now also use return_segments with short form generation).
For long form generation the output is either a torch tensor with the sequence of token ids, or a dict containing the sequences of token ids and a list of all segments if return_segments is set to True. Note that if both return_dict_in_generate and return_segments are set to true, the additional information (attention masks, hidden states) will be contained in segments. However, at the moment we can't get an instance of ModelOutput as output with long form generation.
Should we work on this ?
Ideally, we should also unify the output format for the Whisper generate method so that users don't have to distinguish between short and long form audio. They should only have to specify wether they want to perform sequential generation (non chunked) or parallel generation (chunked) with the pipeline.
The aim of PR #30984 was to implement all the modifications to allow generation with fallback for short form audios without breaking Backward Compatibility on main. If we further unify the output format, we would break backward compatibility and have to adapt several tests.
cc @sanchit-gandhi@ArthurZucker Do you think we should complete the unification of Whisper Generation by unifying the output format?
The text was updated successfully, but these errors were encountered:
Feature request
Unify the output format to Whisper Generate method for short form/long form generation.
Motivation
In PR #30984, short-form and long-form generation in Whisper were unified so that both benefit from generation with fallback.
However, the output to the generate method's format still varies depending on whether we're doing short form or long form generation, as we can see in this line.
For short form generation the output format can be either a torch tensor containing the sequence of token ids or an instance of
ModelOutput
with additional information (attention masks, hidden states, ...) ifreturn_dict_in_generate
is set to True (we can now also usereturn_segments
with short form generation).For long form generation the output is either a torch tensor with the sequence of token ids, or a dict containing the
sequences
of token ids and a list of allsegments
ifreturn_segments
is set to True. Note that if bothreturn_dict_in_generate
andreturn_segments
are set to true, the additional information (attention masks, hidden states) will be contained insegments
. However, at the moment we can't get an instance ofModelOutput
as output with long form generation.Should we work on this ?
Ideally, we should also unify the output format for the Whisper
generate
method so that users don't have to distinguish between short and long form audio. They should only have to specify wether they want to perform sequential generation (non chunked) or parallel generation (chunked) with the pipeline.The aim of PR #30984 was to implement all the modifications to allow generation with fallback for short form audios without breaking Backward Compatibility on main. If we further unify the output format, we would break backward compatibility and have to adapt several tests.
cc @sanchit-gandhi @ArthurZucker Do you think we should complete the unification of Whisper Generation by unifying the output format?
The text was updated successfully, but these errors were encountered: