-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
if output is tuple like facebook/hf-seamless-m4t-medium, waveform is … #29722
Conversation
test code
|
|
@amyeroberts please have a review. |
Hmmm, this is funny, as it indicates either unexpected outputs from the model, it shouldn't be used with this pipeline or that it wasn't properly tested for the pipeline @ylacombe could you please take a look into this? |
…the first element Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
could anyone help review the pr? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks for the fix. Could you add a test?
What is weird is that this model has a vocoder. (hifi_gan): SeamlessM4THifiGan
but the pipeline does not. So it's not properly integrated (or incompatible with SpeechT5Vocoder?You are right, the output of
SeamlessM4TForTextToSpeech` is a wave and length. The fix should be enough
We need a test to make sure the waveform generated is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @sywangyi, thanks for spotting this! Could you also add a test to test_pipeline_text_to_audio.py
so that we make sure it works ?
What is weird is that this model has a vocoder. (hifi_gan): SeamlessM4THifiGan but the pipeline does not. So it's not properly integrated (or incompatible with SpeechT5Vocoder?
This would be the case if it was in MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES
here!
Since the vocoder here is specific to M4T, I made the choice to add the full model (+ vocoder) to MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES
test added and also found some issue when test batch_size=2 @ylacombe |
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
@@ -200,7 +200,8 @@ def _sanitize_parameters( | |||
|
|||
def postprocess(self, waveform): | |||
output_dict = {} | |||
|
|||
if isinstance(waveform, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably make more sense to use return_dict=True
to make sure we rely on naming rather than indexing?
This supposed the first is always a waveform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not return_dict for semaless_m4v case see https://github.com/huggingface/transformers/blob/main/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py#L3520. also if the class is returned. postprocess in text_to_audio.py is lack of disposal of class input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a shortcoming of the modeling code of M4T (and M4Tv2) that doesn't use return_dict=True
:
transformers/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py
Lines 3520 to 3528 in 416711c
if return_intermediate_token_ids: | |
return SeamlessM4TGenerationOutput( | |
waveform=waveform, | |
waveform_lengths=waveform_lengths, | |
sequences=sequences, | |
unit_sequences=output_unit_ids, | |
) | |
return waveform, waveform_lengths |
Would you add to add return_dict
to M4T modeling code @sywangyi ? This would look like something like this where relevant:
transformers/src/transformers/models/roc_bert/modeling_roc_bert.py
Lines 1053 to 1063 in 33288ff
if not return_dict: | |
return (sequence_output, pooled_output) + encoder_outputs[1:] | |
return BaseModelOutputWithPoolingAndCrossAttentions( | |
last_hidden_state=sequence_output, | |
pooler_output=pooled_output, | |
past_key_values=encoder_outputs.past_key_values, | |
hidden_states=encoder_outputs.hidden_states, | |
attentions=encoder_outputs.attentions, | |
cross_attentions=encoder_outputs.cross_attentions, | |
) |
If not, I can take care of this, but the current PR relies on this fix to have batch_size > 2 working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @ylacombe actually return_dict_in_generate should be used. because this is used in generate() func. speech t5 also has such issue. see the return of speech t5. tuple or tensor will be returned. https://github.com/huggingface/transformers/blob/main/src/transformers/models/speecht5/modeling_speecht5.py#L2561-L2593
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I think pipeline should support all format output like dict, tuple, tensor, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating @sywangyi and for spotting the return_dict
shortcoming, let me know if you want to take care of it, otherwise I can do it today to allow for this PR to be merged !
@@ -200,7 +200,8 @@ def _sanitize_parameters( | |||
|
|||
def postprocess(self, waveform): | |||
output_dict = {} | |||
|
|||
if isinstance(waveform, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a shortcoming of the modeling code of M4T (and M4Tv2) that doesn't use return_dict=True
:
transformers/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py
Lines 3520 to 3528 in 416711c
if return_intermediate_token_ids: | |
return SeamlessM4TGenerationOutput( | |
waveform=waveform, | |
waveform_lengths=waveform_lengths, | |
sequences=sequences, | |
unit_sequences=output_unit_ids, | |
) | |
return waveform, waveform_lengths |
Would you add to add return_dict
to M4T modeling code @sywangyi ? This would look like something like this where relevant:
transformers/src/transformers/models/roc_bert/modeling_roc_bert.py
Lines 1053 to 1063 in 33288ff
if not return_dict: | |
return (sequence_output, pooled_output) + encoder_outputs[1:] | |
return BaseModelOutputWithPoolingAndCrossAttentions( | |
last_hidden_state=sequence_output, | |
pooler_output=pooled_output, | |
past_key_values=encoder_outputs.past_key_values, | |
hidden_states=encoder_outputs.hidden_states, | |
attentions=encoder_outputs.attentions, | |
cross_attentions=encoder_outputs.cross_attentions, | |
) |
If not, I can take care of this, but the current PR relies on this fix to have batch_size > 2 working
I also add "return_intermediate_token_ids" in the testcase and fix the issue in postprocess. |
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good catch, thankful that you added a small test! 🤗
if isinstance(waveform, dict): | ||
waveform = waveform["waveform"] | ||
elif isinstance(waveform, tuple): | ||
waveform = waveform[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good! Thanks
…the first element