Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FastSpeech2Conformer #23439

Merged
merged 56 commits into from
Jan 3, 2024

Conversation

connor-henderson
Copy link
Contributor

@connor-henderson connor-henderson commented May 17, 2023

What does this PR do?

Adds the TTS (text-to-speech) conformer version of the FastSpeech2 model. Closest related issue is #15166 though this implements ESPnet's conformer version instead of Fairseq's version as suggested in #15773 (comment).

FastSpeech2 paper (Microsoft)
Conformer version paper (ESPnet)

Conformer version code implementation: https://github.com/espnet/espnet/tree/master/espnet2/tts/fastspeech2
Additional conformer version code code implementation: https://github.com/DigitalPhonetics/IMS-Toucan/blob/ToucanTTS/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2

The paper abstracts say most of this, but the main points of what makes this model an interesting addition are:

  • It's non auto-regressive, leading to faster inference since it doesn't have to make predictions sequentially (hence the name FastSpeech)
  • Uses a variance predictor in between the encoder and decoder to explicitly predict duration, pitch, and energy, leading to more accurate results
  • Conformer architectures have been shown to improve performance in text to speech tasks, with the convolutions learning close range speech patterns and transformer attention helping with understanding longer range contexts
  • There is currently only one other text-to-speech model in transformers (SpeechT5ForTextToSpeech)

To do

  • Prepared 🤗 Transformers dev environment
  • Set up debugging environment of the original repository
  • Created script that successfully runs the forward() pass using the original repository and checkpoint
  • Successfully added the model skeleton to 🤗 Transformers (+ vocoder)
  • Successfully converted original checkpoint to 🤗 Transformers checkpoint (+ vocoder)
  • Successfully ran forward() pass in 🤗 Transformers that gives identical output to original checkpoint (+ vocoder)
  • Finished model tests in 🤗 Transformers
  • Successfully added tokenizer in 🤗 Transformers
  • Run end-to-end integration tests
  • Finished docs
  • Uploaded model weights to the Hub (will ask they're moved to just fastspeech2_conformer when ready)

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@hollance @sanchit-gandhi

@connor-henderson connor-henderson force-pushed the add-FastSpeech2Conformer branch 2 times, most recently from 4a19759 to 9b052b4 Compare June 11, 2023 23:13
@connor-henderson connor-henderson changed the title [WIP] Add FastSpeech2Conformer Add FastSpeech2Conformer Jun 12, 2023
@connor-henderson connor-henderson marked this pull request as ready for review June 12, 2023 03:06
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@connor-henderson connor-henderson force-pushed the add-FastSpeech2Conformer branch from c17b488 to e48eb17 Compare June 12, 2023 14:03
Copy link
Contributor

@hollance hollance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Connor, apologies for the delay in doing the review. Aside from the handful of small things below, I think this is an excellent PR already.

I can understand some of the design decisions you made, but the Transformers team is pretty strict about how the code should be structured, and so some of this code will have to change in order to be accepted. I don't think it's going to be too much work, though, since what you have here is pretty solid. 😄

@connor-henderson connor-henderson force-pushed the add-FastSpeech2Conformer branch from 30fe94b to 4bd5efa Compare June 26, 2023 01:43
@connor-henderson
Copy link
Contributor Author

Thanks for the review @hollance! Addressed the comments above, the only part that might need follow-up discussion is making the labels compatible with the Trainer

Re labels, FastSpeech2 is somewhat unique in that it takes in many labels (spectrograms, pitch, energy, and duration) for training. I'm not sure exactly what this means for compatibility with Trainer since I haven't had time to do a deeper dive, but for now I've changed the "targets" to include _labels in their name, left the training test as skipped, and planning to look into it more when I do the demo notebook.

Copy link
Contributor

@hollance hollance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Connor, just a few more comments on the code. It's looking very good already!

If you're happy with the code so far, feel free to request a review from @sanchit-gandhi and/or @ArthurZucker.

@connor-henderson connor-henderson force-pushed the add-FastSpeech2Conformer branch 2 times, most recently from 2b1f0f7 to 5f8ff7d Compare June 29, 2023 18:31
@connor-henderson
Copy link
Contributor Author

Appreciate the comments @hollance, @ArthurZucker @sanchit-gandhi this should be ready for review for you now

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great already @connor-henderson! Mainly just refactoring / styling suggestions from my part. Would be great to bring the attention / transformer layers into alignment a bit more with the rest of the library (see suggestions below), and think we can simplify a lot of the __init__ logic by always passing the config as an attribute and avoiding nested configs.

Happy to clarify any of the points below or answer any questions! Also cc'ing @ylacombe

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great already @connor-henderson! Mainly just refactoring / styling suggestions from my part. Would be great to bring the attention / transformer layers into alignment a bit more with the rest of the library (see suggestions below), and think we can simplify a lot of the __init__ logic by always passing the config as an attribute and avoiding nested configs.

Happy to clarify any of the points below or answer any questions! Also cc'ing @ylacombe

@sanchit-gandhi sanchit-gandhi requested review from ArthurZucker and removed request for ArthurZucker July 10, 2023 18:14
@ylacombe ylacombe mentioned this pull request Jul 20, 2023
5 tasks
@connor-henderson connor-henderson force-pushed the add-FastSpeech2Conformer branch 2 times, most recently from 8349819 to 7222ad6 Compare July 26, 2023 21:20
@connor-henderson
Copy link
Contributor Author

Thank you for the review @sanchit-gandhi, comments should be addressed now.

Centralizing a note on passing the config instead of args here since there were a few comments on that - the other modules mentioned are all instantiated twice with different arg values so they can’t solely be passed the config. Lmk if you think there’s something I missed or if you’d prefer something else like duplicating the modules in order to pass just the config.

def __init__(
self,
config: FastSpeech2ConformerConfig,
attention_heads=4,
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewer: the reason that we specify these arguments in addition to the config is that the number of attention heads is configurable depending on whether this is the encoder (config.encoder_attention_heads) or decoder (config.decoder_attention_heads), so we pass this as an additional argument depending on whether this module belongs to the encoder / decoder resp.

This is in-keeping with other models in the library, e.g. T5:

num_layers (`int`, *optional*, defaults to 6):
Number of hidden layers in the Transformer encoder.
num_decoder_layers (`int`, *optional*):
Number of hidden layers in the Transformer decoder. Will use the same value as `num_layers` if not set.

But we could simplify it and assume that the encoder and decoder will have the same number of heads and ffn dim, in which case this would collapse to just taking the config as input

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating here @connor-henderson, the PR is looking really good! Just a few small suggestions, then I think we can get a final review here and move the checkpoints to the official org

@connor-henderson connor-henderson force-pushed the add-FastSpeech2Conformer branch from 0cf0059 to 858d7a9 Compare August 1, 2023 00:41
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking super clean - thanks for iterating @connor-henderson! Especially like the new "WithHiFiGanHead" class and overall the FS2 API. I think we've got as many modules as possible to be intialised from the config -> the remainder need more flexibility (as mentioned in the comments above).

Would love a second opinion from @ArthurZucker on the tokenizer, and more generally on the PR integration before we get this merged!

@ArthurZucker
Copy link
Collaborator

Reviewing now!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already looks nice, great work for a hard model to add! Will do another pass once you adress current comments! Mostly nits on the modeling for readability, more comments on the tokenizer to remove the hard dependency!
Very nice tests 🚀 🤗

@connor-henderson connor-henderson force-pushed the add-FastSpeech2Conformer branch from 0c2148f to 91cf792 Compare January 3, 2024 16:46
@ylacombe
Copy link
Contributor

ylacombe commented Jan 3, 2024

It's been a long ride, but merging now!

Thanks again for the great work and your patience !

@ylacombe ylacombe merged commit d83ff5e into huggingface:main Jan 3, 2024
@connor-henderson connor-henderson deleted the add-FastSpeech2Conformer branch January 3, 2024 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants