if output is tuple like facebook/hf-seamless-m4t-medium, waveform is … #29722

sywangyi · 2024-03-19T05:24:24Z

…the first element

sywangyi · 2024-03-19T05:26:48Z

test code

import torch
from transformers import pipeline, set_seed
import scipy


def main():
    set_seed(555)

    generator = pipeline(
        "text-to-speech",
        model="facebook/hf-seamless-m4t-medium",
        torch_dtype=torch.float32,
        device="cuda",
    )

    speech = generator("Hello, my dog is cute", forward_params={"tgt_lang": "rus"})
    print(f"speech = {speech}")
    scipy.io.wavfile.write("techno2.wav", rate=speech["sampling_rate"], data=speech["audio"][0])


if __name__ == "__main__":
    main()

sywangyi · 2024-03-19T05:28:36Z

Traceback (most recent call last):
  File "/mnt/disk1/wangyi/transformers/test.py", line 23, in <module>
    main()
  File "/mnt/disk1/wangyi/transformers/test.py", line 17, in main
    speech = generator("Hello, my dog is cute", forward_params={"tgt_lang": "rus"})
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/text_to_audio.py", line 182, in __call__
    return super().__call__(text_inputs, **forward_params)
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/base.py", line 1206, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/base.py", line 1214, in run_single
    outputs = self.postprocess(model_outputs, **postprocess_params)
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/text_to_audio.py", line 204, in postprocess
    output_dict["audio"] = waveform.cpu().float().numpy()
AttributeError: 'tuple' object has no attribute 'cpu

sywangyi · 2024-03-19T05:30:10Z

@amyeroberts please have a review.

amyeroberts · 2024-03-19T12:51:32Z

Hmmm, this is funny, as it indicates either unexpected outputs from the model, it shouldn't be used with this pipeline or that it wasn't properly tested for the pipeline @ylacombe could you please take a look into this?

…the first element Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

sywangyi · 2024-03-29T23:50:16Z

could anyone help review the pr？

ArthurZucker

THanks for the fix. Could you add a test?
What is weird is that this model has a vocoder. (hifi_gan): SeamlessM4THifiGan but the pipeline does not. So it's not properly integrated (or incompatible with SpeechT5Vocoder?You are right, the output ofSeamlessM4TForTextToSpeech` is a wave and length. The fix should be enough
We need a test to make sure the waveform generated is correct.

ylacombe

Hey @sywangyi, thanks for spotting this! Could you also add a test to test_pipeline_text_to_audio.py so that we make sure it works ?

What is weird is that this model has a vocoder. (hifi_gan): SeamlessM4THifiGan but the pipeline does not. So it's not properly integrated (or incompatible with SpeechT5Vocoder?

This would be the case if it was in MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES here!
Since the vocoder here is specific to M4T, I made the choice to add the full model (+ vocoder) to MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES

sywangyi · 2024-04-01T09:26:18Z

test added and also found some issue when test batch_size=2 @ylacombe

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

src/transformers/pipelines/pt_utils.py

ArthurZucker · 2024-04-02T08:20:57Z

src/transformers/pipelines/text_to_audio.py

@@ -200,7 +200,8 @@ def _sanitize_parameters(

    def postprocess(self, waveform):
        output_dict = {}
-
+        if isinstance(waveform, tuple):


It would probably make more sense to use return_dict=True to make sure we rely on naming rather than indexing?
This supposed the first is always a waveform

not return_dict for semaless_m4v case see https://github.com/huggingface/transformers/blob/main/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py#L3520. also if the class is returned. postprocess in text_to_audio.py is lack of disposal of class input.

This is a shortcoming of the modeling code of M4T (and M4Tv2) that doesn't use return_dict=True:

transformers/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py

Lines 3520 to 3528 in 416711c

if return_intermediate_token_ids:

return SeamlessM4TGenerationOutput(

waveform=waveform,

waveform_lengths=waveform_lengths,

sequences=sequences,

unit_sequences=output_unit_ids,

)

return waveform, waveform_lengths

Would you add to add return_dict to M4T modeling code @sywangyi ? This would look like something like this where relevant:

transformers/src/transformers/models/roc_bert/modeling_roc_bert.py

Lines 1053 to 1063 in 33288ff

if not return_dict:

return (sequence_output, pooled_output) + encoder_outputs[1:]

return BaseModelOutputWithPoolingAndCrossAttentions(

last_hidden_state=sequence_output,

pooler_output=pooled_output,

past_key_values=encoder_outputs.past_key_values,

hidden_states=encoder_outputs.hidden_states,

attentions=encoder_outputs.attentions,

cross_attentions=encoder_outputs.cross_attentions,

)

If not, I can take care of this, but the current PR relies on this fix to have batch_size > 2 working

Hi, @ylacombe actually return_dict_in_generate should be used. because this is used in generate() func. speech t5 also has such issue. see the return of speech t5. tuple or tensor will be returned. https://github.com/huggingface/transformers/blob/main/src/transformers/models/speecht5/modeling_speecht5.py#L2561-L2593

so I think pipeline should support all format output like dict, tuple, tensor, WDYT?

tests/pipelines/test_pipelines_text_to_audio.py

ylacombe

Thanks for iterating @sywangyi and for spotting the return_dict shortcoming, let me know if you want to take care of it, otherwise I can do it today to allow for this PR to be merged !

src/transformers/pipelines/pt_utils.py

ylacombe · 2024-04-02T09:21:33Z

src/transformers/pipelines/text_to_audio.py

@@ -200,7 +200,8 @@ def _sanitize_parameters(

    def postprocess(self, waveform):
        output_dict = {}
-
+        if isinstance(waveform, tuple):


This is a shortcoming of the modeling code of M4T (and M4Tv2) that doesn't use return_dict=True:

transformers/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py

Lines 3520 to 3528 in 416711c

if return_intermediate_token_ids:

return SeamlessM4TGenerationOutput(

waveform=waveform,

waveform_lengths=waveform_lengths,

sequences=sequences,

unit_sequences=output_unit_ids,

)

return waveform, waveform_lengths

Would you add to add return_dict to M4T modeling code @sywangyi ? This would look like something like this where relevant:

transformers/src/transformers/models/roc_bert/modeling_roc_bert.py

Lines 1053 to 1063 in 33288ff

if not return_dict:

return (sequence_output, pooled_output) + encoder_outputs[1:]

return BaseModelOutputWithPoolingAndCrossAttentions(

last_hidden_state=sequence_output,

pooler_output=pooled_output,

past_key_values=encoder_outputs.past_key_values,

hidden_states=encoder_outputs.hidden_states,

attentions=encoder_outputs.attentions,

cross_attentions=encoder_outputs.cross_attentions,

)

If not, I can take care of this, but the current PR relies on this fix to have batch_size > 2 working

tests/pipelines/test_pipelines_text_to_audio.py

sywangyi · 2024-04-02T11:43:13Z

I also add "return_intermediate_token_ids" in the testcase and fix the issue in postprocess.

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

ArthurZucker

A good catch, thankful that you added a small test! 🤗

ArthurZucker · 2024-04-05T07:26:12Z

src/transformers/pipelines/text_to_audio.py

+        if isinstance(waveform, dict):
+            waveform = waveform["waveform"]
+        elif isinstance(waveform, tuple):
+            waveform = waveform[0]


good! Thanks

if output is tuple like facebook/hf-seamless-m4t-medium, waveform is …

7cdd092

…the first element Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

sywangyi mentioned this pull request Mar 21, 2024

optimize seamless-m4t/vits model for text-to-speech generation huggingface/optimum-habana#825

Merged

3 tasks

ArthurZucker reviewed Mar 30, 2024

View reviewed changes

ArthurZucker requested a review from ylacombe March 30, 2024 18:14

ylacombe approved these changes Apr 1, 2024

View reviewed changes

add test and fix batch issue

8de5ac6

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

ArthurZucker reviewed Apr 2, 2024

View reviewed changes

ylacombe reviewed Apr 2, 2024

View reviewed changes

add dict output support for seamless_m4t

62d4ca1

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

ArthurZucker approved these changes Apr 5, 2024

View reviewed changes

ArthurZucker merged commit 79d62b2 into huggingface:main Apr 5, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

if output is tuple like facebook/hf-seamless-m4t-medium, waveform is … #29722

if output is tuple like facebook/hf-seamless-m4t-medium, waveform is … #29722

sywangyi commented Mar 19, 2024

sywangyi commented Mar 19, 2024

sywangyi commented Mar 19, 2024

sywangyi commented Mar 19, 2024

amyeroberts commented Mar 19, 2024

sywangyi commented Mar 29, 2024

ArthurZucker left a comment

ylacombe left a comment

sywangyi commented Apr 1, 2024

ArthurZucker Apr 2, 2024

sywangyi Apr 2, 2024

ylacombe Apr 2, 2024

sywangyi Apr 2, 2024 •

edited

Loading

sywangyi Apr 3, 2024

ylacombe left a comment

ylacombe Apr 2, 2024

sywangyi commented Apr 2, 2024

ArthurZucker left a comment

ArthurZucker Apr 5, 2024

	if return_intermediate_token_ids:
	return SeamlessM4TGenerationOutput(
	waveform=waveform,
	waveform_lengths=waveform_lengths,
	sequences=sequences,
	unit_sequences=output_unit_ids,
	)

	return waveform, waveform_lengths

	if not return_dict:
	return (sequence_output, pooled_output) + encoder_outputs[1:]

	return BaseModelOutputWithPoolingAndCrossAttentions(
	last_hidden_state=sequence_output,
	pooler_output=pooled_output,
	past_key_values=encoder_outputs.past_key_values,
	hidden_states=encoder_outputs.hidden_states,
	attentions=encoder_outputs.attentions,
	cross_attentions=encoder_outputs.cross_attentions,
	)

if output is tuple like facebook/hf-seamless-m4t-medium, waveform is … #29722

if output is tuple like facebook/hf-seamless-m4t-medium, waveform is … #29722

Conversation

sywangyi commented Mar 19, 2024

sywangyi commented Mar 19, 2024

sywangyi commented Mar 19, 2024

sywangyi commented Mar 19, 2024

amyeroberts commented Mar 19, 2024

sywangyi commented Mar 29, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ylacombe left a comment

Choose a reason for hiding this comment

sywangyi commented Apr 1, 2024

ArthurZucker Apr 2, 2024

Choose a reason for hiding this comment

sywangyi Apr 2, 2024

Choose a reason for hiding this comment

ylacombe Apr 2, 2024

Choose a reason for hiding this comment

sywangyi Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

sywangyi Apr 3, 2024

Choose a reason for hiding this comment

ylacombe left a comment

Choose a reason for hiding this comment

ylacombe Apr 2, 2024

Choose a reason for hiding this comment

sywangyi commented Apr 2, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Apr 5, 2024

Choose a reason for hiding this comment

sywangyi Apr 2, 2024 •

edited

Loading