Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if output is tuple like facebook/hf-seamless-m4t-medium, waveform is … #29722

Merged
merged 3 commits into from
Apr 5, 2024

Conversation

sywangyi
Copy link
Contributor

…the first element

@sywangyi
Copy link
Contributor Author

test code

import torch
from transformers import pipeline, set_seed
import scipy


def main():
    set_seed(555)

    generator = pipeline(
        "text-to-speech",
        model="facebook/hf-seamless-m4t-medium",
        torch_dtype=torch.float32,
        device="cuda",
    )

    speech = generator("Hello, my dog is cute", forward_params={"tgt_lang": "rus"})
    print(f"speech = {speech}")
    scipy.io.wavfile.write("techno2.wav", rate=speech["sampling_rate"], data=speech["audio"][0])


if __name__ == "__main__":
    main()

@sywangyi
Copy link
Contributor Author

Traceback (most recent call last):
  File "/mnt/disk1/wangyi/transformers/test.py", line 23, in <module>
    main()
  File "/mnt/disk1/wangyi/transformers/test.py", line 17, in main
    speech = generator("Hello, my dog is cute", forward_params={"tgt_lang": "rus"})
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/text_to_audio.py", line 182, in __call__
    return super().__call__(text_inputs, **forward_params)
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/base.py", line 1206, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/base.py", line 1214, in run_single
    outputs = self.postprocess(model_outputs, **postprocess_params)
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/text_to_audio.py", line 204, in postprocess
    output_dict["audio"] = waveform.cpu().float().numpy()
AttributeError: 'tuple' object has no attribute 'cpu

@sywangyi
Copy link
Contributor Author

@amyeroberts please have a review.

@amyeroberts
Copy link
Collaborator

Hmmm, this is funny, as it indicates either unexpected outputs from the model, it shouldn't be used with this pipeline or that it wasn't properly tested for the pipeline @ylacombe could you please take a look into this?

…the first element

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
@sywangyi
Copy link
Contributor Author

could anyone help review the pr?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks for the fix. Could you add a test?
What is weird is that this model has a vocoder. (hifi_gan): SeamlessM4THifiGan but the pipeline does not. So it's not properly integrated (or incompatible with SpeechT5Vocoder?You are right, the output ofSeamlessM4TForTextToSpeech` is a wave and length. The fix should be enough
We need a test to make sure the waveform generated is correct.

@ArthurZucker ArthurZucker requested a review from ylacombe March 30, 2024 18:14
Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sywangyi, thanks for spotting this! Could you also add a test to test_pipeline_text_to_audio.py so that we make sure it works ?

What is weird is that this model has a vocoder. (hifi_gan): SeamlessM4THifiGan but the pipeline does not. So it's not properly integrated (or incompatible with SpeechT5Vocoder?

This would be the case if it was in MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES here!
Since the vocoder here is specific to M4T, I made the choice to add the full model (+ vocoder) to MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES

@sywangyi
Copy link
Contributor Author

sywangyi commented Apr 1, 2024

test added and also found some issue when test batch_size=2 @ylacombe

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
src/transformers/pipelines/pt_utils.py Show resolved Hide resolved
src/transformers/pipelines/pt_utils.py Show resolved Hide resolved
@@ -200,7 +200,8 @@ def _sanitize_parameters(

def postprocess(self, waveform):
output_dict = {}

if isinstance(waveform, tuple):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably make more sense to use return_dict=True to make sure we rely on naming rather than indexing?
This supposed the first is always a waveform

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not return_dict for semaless_m4v case see https://github.com/huggingface/transformers/blob/main/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py#L3520. also if the class is returned. postprocess in text_to_audio.py is lack of disposal of class input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a shortcoming of the modeling code of M4T (and M4Tv2) that doesn't use return_dict=True:

if return_intermediate_token_ids:
return SeamlessM4TGenerationOutput(
waveform=waveform,
waveform_lengths=waveform_lengths,
sequences=sequences,
unit_sequences=output_unit_ids,
)
return waveform, waveform_lengths

Would you add to add return_dict to M4T modeling code @sywangyi ? This would look like something like this where relevant:

if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPoolingAndCrossAttentions(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
past_key_values=encoder_outputs.past_key_values,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
cross_attentions=encoder_outputs.cross_attentions,
)

If not, I can take care of this, but the current PR relies on this fix to have batch_size > 2 working

Copy link
Contributor Author

@sywangyi sywangyi Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ylacombe actually return_dict_in_generate should be used. because this is used in generate() func. speech t5 also has such issue. see the return of speech t5. tuple or tensor will be returned. https://github.com/huggingface/transformers/blob/main/src/transformers/models/speecht5/modeling_speecht5.py#L2561-L2593

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think pipeline should support all format output like dict, tuple, tensor, WDYT?

Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating @sywangyi and for spotting the return_dict shortcoming, let me know if you want to take care of it, otherwise I can do it today to allow for this PR to be merged !

src/transformers/pipelines/pt_utils.py Show resolved Hide resolved
@@ -200,7 +200,8 @@ def _sanitize_parameters(

def postprocess(self, waveform):
output_dict = {}

if isinstance(waveform, tuple):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a shortcoming of the modeling code of M4T (and M4Tv2) that doesn't use return_dict=True:

if return_intermediate_token_ids:
return SeamlessM4TGenerationOutput(
waveform=waveform,
waveform_lengths=waveform_lengths,
sequences=sequences,
unit_sequences=output_unit_ids,
)
return waveform, waveform_lengths

Would you add to add return_dict to M4T modeling code @sywangyi ? This would look like something like this where relevant:

if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPoolingAndCrossAttentions(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
past_key_values=encoder_outputs.past_key_values,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
cross_attentions=encoder_outputs.cross_attentions,
)

If not, I can take care of this, but the current PR relies on this fix to have batch_size > 2 working

@sywangyi
Copy link
Contributor Author

sywangyi commented Apr 2, 2024

I also add "return_intermediate_token_ids" in the testcase and fix the issue in postprocess.

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good catch, thankful that you added a small test! 🤗

Comment on lines +203 to +206
if isinstance(waveform, dict):
waveform = waveform["waveform"]
elif isinstance(waveform, tuple):
waveform = waveform[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good! Thanks

@ArthurZucker ArthurZucker merged commit 79d62b2 into huggingface:main Apr 5, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants