Add FastSpeech2Conformer #23439

connor-henderson · 2023-05-17T22:02:44Z

What does this PR do?

Adds the TTS (text-to-speech) conformer version of the FastSpeech2 model. Closest related issue is #15166 though this implements ESPnet's conformer version instead of Fairseq's version as suggested in #15773 (comment).

FastSpeech2 paper (Microsoft)
Conformer version paper (ESPnet)

Conformer version code implementation: https://github.com/espnet/espnet/tree/master/espnet2/tts/fastspeech2
Additional conformer version code code implementation: https://github.com/DigitalPhonetics/IMS-Toucan/blob/ToucanTTS/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2

The paper abstracts say most of this, but the main points of what makes this model an interesting addition are:

It's non auto-regressive, leading to faster inference since it doesn't have to make predictions sequentially (hence the name FastSpeech)
Uses a variance predictor in between the encoder and decoder to explicitly predict duration, pitch, and energy, leading to more accurate results
Conformer architectures have been shown to improve performance in text to speech tasks, with the convolutions learning close range speech patterns and transformer attention helping with understanding longer range contexts
There is currently only one other text-to-speech model in transformers (SpeechT5ForTextToSpeech)

To do

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@hollance @sanchit-gandhi

HuggingFaceDocBuilderDev · 2023-06-12T13:39:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

hollance

Hey Connor, apologies for the delay in doing the review. Aside from the handful of small things below, I think this is an excellent PR already.

I can understand some of the design decisions you made, but the Transformers team is pretty strict about how the code should be structured, and so some of this code will have to change in order to be accepted. I don't think it's going to be too much work, though, since what you have here is pretty solid. 😄

setup.py

src/transformers/models/fastspeech2_conformer/configuration_fastspeech2_conformer.py

src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py

connor-henderson · 2023-06-26T02:45:42Z

Thanks for the review @hollance! Addressed the comments above, the only part that might need follow-up discussion is making the labels compatible with the Trainer

Re labels, FastSpeech2 is somewhat unique in that it takes in many labels (spectrograms, pitch, energy, and duration) for training. I'm not sure exactly what this means for compatibility with Trainer since I haven't had time to do a deeper dive, but for now I've changed the "targets" to include _labels in their name, left the training test as skipped, and planning to look into it more when I do the demo notebook.

hollance

Hey Connor, just a few more comments on the code. It's looking very good already!

If you're happy with the code so far, feel free to request a review from @sanchit-gandhi and/or @ArthurZucker.

src/transformers/models/fastspeech2_conformer/configuration_fastspeech2_conformer.py

src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py

src/transformers/models/speecht5/configuration_speecht5.py

tests/models/fastspeech2_conformer/test_modeling_fastspeech2_conformer.py

connor-henderson · 2023-06-29T20:33:34Z

Appreciate the comments @hollance, @ArthurZucker @sanchit-gandhi this should be ready for review for you now

sanchit-gandhi

Looks great already @connor-henderson! Mainly just refactoring / styling suggestions from my part. Would be great to bring the attention / transformer layers into alignment a bit more with the rest of the library (see suggestions below), and think we can simplify a lot of the __init__ logic by always passing the config as an attribute and avoiding nested configs.

Happy to clarify any of the points below or answer any questions! Also cc'ing @ylacombe

src/transformers/models/fastspeech2_conformer/configuration_fastspeech2_conformer.py

src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py

src/transformers/models/fastspeech2_conformer/tokenization_fastspeech2_conformer.py

src/transformers/models/speecht5/modeling_speecht5.py

tests/models/fastspeech2_conformer/test_modeling_fastspeech2_conformer.py

tests/models/fastspeech2_conformer/test_tokenization_fastspeech2_conformer.py

sanchit-gandhi

Looks great already @connor-henderson! Mainly just refactoring / styling suggestions from my part. Would be great to bring the attention / transformer layers into alignment a bit more with the rest of the library (see suggestions below), and think we can simplify a lot of the __init__ logic by always passing the config as an attribute and avoiding nested configs.

Happy to clarify any of the points below or answer any questions! Also cc'ing @ylacombe

connor-henderson · 2023-07-26T21:58:25Z

Thank you for the review @sanchit-gandhi, comments should be addressed now.

Centralizing a note on passing the config instead of args here since there were a few comments on that - the other modules mentioned are all instantiated twice with different arg values so they can’t solely be passed the config. Lmk if you think there’s something I missed or if you’d prefer something else like duplicating the modules in order to pass just the config.

src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py

sanchit-gandhi · 2023-07-28T10:58:41Z

src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py

+    def __init__(
+        self,
+        config: FastSpeech2ConformerConfig,
+        attention_heads=4,


Note to reviewer: the reason that we specify these arguments in addition to the config is that the number of attention heads is configurable depending on whether this is the encoder (config.encoder_attention_heads) or decoder (config.decoder_attention_heads), so we pass this as an additional argument depending on whether this module belongs to the encoder / decoder resp.

This is in-keeping with other models in the library, e.g. T5:

transformers/src/transformers/models/t5/configuration_t5.py

Lines 55 to 58 in 6232c38

num_layers (`int`, *optional*, defaults to 6):

Number of hidden layers in the Transformer encoder.

num_decoder_layers (`int`, *optional*):

Number of hidden layers in the Transformer decoder. Will use the same value as `num_layers` if not set.

But we could simplify it and assume that the encoder and decoder will have the same number of heads and ffn dim, in which case this would collapse to just taking the config as input

src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py

src/transformers/models/fastspeech2_conformer/tokenization_fastspeech2_conformer.py

tests/models/fastspeech2_conformer/test_modeling_fastspeech2_conformer.py

sanchit-gandhi

Thanks for iterating here @connor-henderson, the PR is looking really good! Just a few small suggestions, then I think we can get a final review here and move the checkpoints to the official org

tests/models/fastspeech2_conformer/test_modeling_fastspeech2_conformer.py

sanchit-gandhi

This is looking super clean - thanks for iterating @connor-henderson! Especially like the new "WithHiFiGanHead" class and overall the FS2 API. I think we've got as many modules as possible to be intialised from the config -> the remainder need more flexibility (as mentioned in the comments above).

Would love a second opinion from @ArthurZucker on the tokenizer, and more generally on the PR integration before we get this merged!

src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py

ArthurZucker · 2023-08-18T12:17:12Z

Reviewing now!

ArthurZucker

It already looks nice, great work for a hard model to add! Will do another pass once you adress current comments! Mostly nits on the modeling for readability, more comments on the tokenizer to remove the hard dependency!
Very nice tests 🚀 🤗

README.md

docs/source/en/index.md

setup.py

src/transformers/dependency_versions_table.py

src/transformers/models/auto/tokenization_auto.py

src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py

ylacombe · 2024-01-03T18:00:54Z

It's been a long ride, but merging now!

Thanks again for the great work and your patience !

sanchit-gandhi mentioned this pull request May 18, 2023

[New model] 🐸TTS advanced Text-to-Speech #23050

Open

2 tasks

connor-henderson force-pushed the add-FastSpeech2Conformer branch 2 times, most recently from 4a19759 to 9b052b4 Compare June 11, 2023 23:13

connor-henderson changed the title ~~[WIP] Add FastSpeech2Conformer~~ Add FastSpeech2Conformer Jun 12, 2023

connor-henderson marked this pull request as ready for review June 12, 2023 03:06

connor-henderson force-pushed the add-FastSpeech2Conformer branch from c17b488 to e48eb17 Compare June 12, 2023 14:03

hollance suggested changes Jun 20, 2023

View reviewed changes

connor-henderson force-pushed the add-FastSpeech2Conformer branch from 30fe94b to 4bd5efa Compare June 26, 2023 01:43

hollance reviewed Jun 29, 2023

View reviewed changes

connor-henderson force-pushed the add-FastSpeech2Conformer branch 2 times, most recently from 2b1f0f7 to 5f8ff7d Compare June 29, 2023 18:31

sanchit-gandhi reviewed Jul 10, 2023

View reviewed changes

sanchit-gandhi requested review from ArthurZucker and removed request for ArthurZucker July 10, 2023 18:14

ylacombe mentioned this pull request Jul 20, 2023

Add Text-To-Speech pipeline #24952

Merged

5 tasks

connor-henderson force-pushed the add-FastSpeech2Conformer branch 2 times, most recently from 8349819 to 7222ad6 Compare July 26, 2023 21:20

sanchit-gandhi reviewed Jul 28, 2023

View reviewed changes

tests/models/fastspeech2_conformer/test_modeling_fastspeech2_conformer.py Show resolved Hide resolved

tests/models/fastspeech2_conformer/test_modeling_fastspeech2_conformer.py Show resolved Hide resolved

connor-henderson force-pushed the add-FastSpeech2Conformer branch from 0cf0059 to 858d7a9 Compare August 1, 2023 00:41

sanchit-gandhi approved these changes Aug 11, 2023

View reviewed changes

src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py Show resolved Hide resolved

sanchit-gandhi requested a review from ArthurZucker August 11, 2023 17:22

ArthurZucker reviewed Aug 18, 2023

View reviewed changes

Connor Henderson added 22 commits January 3, 2024 11:46

add forward method + tests to WithHifiGan model

ae5df1c

make style

5f255fc

address arg passing and generate_speech comments

a8bad83

address Arthur comments

ec0f58f

address Arthur comments pt2

8f52f85

lint changes

4c7084b

Sanchit comment

049b895

add g2p-en to doctest deps

a7ac8c5

move up self.encoder

9bdbce7

onnx compatible tensor method

3938a6e

fix is symbolic

e51ddac

fix paper url

be72e2d

move models to espnet org

30c0755

make style

ec8d670

make fix-copies

d4b4d20

update docstring

e0aa017

Arthur comments

254aa60

update docstring w/ new updates

ff4e7ca

add model architecture images

4848d77

header size

efbdc71

md wording update

dd33c4b

make style

91cf792

connor-henderson force-pushed the add-FastSpeech2Conformer branch from 0c2148f to 91cf792 Compare January 3, 2024 16:46

ylacombe merged commit d83ff5e into huggingface:main Jan 3, 2024

connor-henderson deleted the add-FastSpeech2Conformer branch January 3, 2024 18:01

amyeroberts mentioned this pull request Jan 4, 2024

Support : Leverage Accelerate for object detection/segmentation models #28312

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FastSpeech2Conformer #23439

Add FastSpeech2Conformer #23439

connor-henderson commented May 17, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 12, 2023

hollance left a comment

connor-henderson commented Jun 26, 2023

hollance left a comment

connor-henderson commented Jun 29, 2023

sanchit-gandhi left a comment •

edited

Loading

sanchit-gandhi left a comment •

edited

Loading

connor-henderson commented Jul 26, 2023

sanchit-gandhi Jul 28, 2023 •

edited

Loading

sanchit-gandhi left a comment

sanchit-gandhi left a comment

ArthurZucker commented Aug 18, 2023

ArthurZucker left a comment

ylacombe commented Jan 3, 2024

	num_layers (`int`, optional, defaults to 6):
	Number of hidden layers in the Transformer encoder.
	num_decoder_layers (`int`, optional):
	Number of hidden layers in the Transformer decoder. Will use the same value as `num_layers` if not set.

Add FastSpeech2Conformer #23439

Add FastSpeech2Conformer #23439

Conversation

connor-henderson commented May 17, 2023 • edited Loading

What does this PR do?

To do

Who can review?

HuggingFaceDocBuilderDev commented Jun 12, 2023

hollance left a comment

Choose a reason for hiding this comment

connor-henderson commented Jun 26, 2023

hollance left a comment

Choose a reason for hiding this comment

connor-henderson commented Jun 29, 2023

sanchit-gandhi left a comment • edited Loading

Choose a reason for hiding this comment

sanchit-gandhi left a comment • edited Loading

Choose a reason for hiding this comment

connor-henderson commented Jul 26, 2023

sanchit-gandhi Jul 28, 2023 • edited Loading

Choose a reason for hiding this comment

sanchit-gandhi left a comment

Choose a reason for hiding this comment

sanchit-gandhi left a comment

Choose a reason for hiding this comment

ArthurZucker commented Aug 18, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

ylacombe commented Jan 3, 2024

connor-henderson commented May 17, 2023 •

edited

Loading

sanchit-gandhi left a comment •

edited

Loading

sanchit-gandhi left a comment •

edited

Loading

sanchit-gandhi Jul 28, 2023 •

edited

Loading