Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update models (Rebased) #1078

Closed
wants to merge 67 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
c971e06
Refactor Synthesizer class for TTSTokenizer
erogol Nov 16, 2021
c9142eb
Refactor TTSDataset to use TTSTokenizer
erogol Nov 16, 2021
da13f46
Refactor synthesis.py for TTSTokenizer
erogol Nov 16, 2021
2588e82
Refactor GlowTTS model and recipe for TTSTokenizer
erogol Nov 16, 2021
e1db180
Update imports for symbols -> characters
erogol Nov 17, 2021
66cad5b
Update for tokenizer API
erogol Nov 24, 2021
4884169
Refactor TTSDataset ⚡️
erogol Nov 30, 2021
580b99e
Refactorin VITS for the tokenizer API
erogol Nov 30, 2021
7c46d5e
Update data loader tests
erogol Dec 1, 2021
033dedf
Add init_from_config
erogol Dec 7, 2021
29ff0f6
Make lint
erogol Dec 7, 2021
49fef8d
Allow None pad and blank tokens
erogol Dec 7, 2021
7deadfe
Use the same phonemizer for `en` to `en-us`
erogol Dec 7, 2021
83b9fda
Pass samples to init_from_config in SpeakerManager
erogol Dec 7, 2021
ae96243
Update VITS for the new API
erogol Dec 7, 2021
160115b
Update Tacotron models
erogol Dec 7, 2021
0ff11d4
Update ForwardTTS
erogol Dec 7, 2021
f46ad54
Update AlignTTS
erogol Dec 7, 2021
a8a8365
Update GlowTTS
erogol Dec 7, 2021
4640d59
Update setup_model
erogol Dec 7, 2021
f8fbbd4
Update BaseTTS config
erogol Dec 7, 2021
ab413fd
Update train_tts.py
erogol Dec 7, 2021
cee01a6
Update ljspeech recipes
erogol Dec 7, 2021
e9448ca
Update loader tests
erogol Dec 7, 2021
c974633
Update tests
erogol Dec 7, 2021
848fd73
Update spec extractor
erogol Dec 7, 2021
3a15e2f
Update ljspeech download
erogol Dec 7, 2021
9338c7b
Update pylintrc
erogol Dec 7, 2021
672d766
Update VCTK formatter
erogol Dec 8, 2021
cecce06
Add file_ext args to resample.py
erogol Dec 8, 2021
95df38c
Update VCTK recipes
erogol Dec 8, 2021
b4cbf2e
Fix `too many open files`
erogol Dec 8, 2021
bbad03e
Update recipes README.md
erogol Dec 8, 2021
13a8f71
Delete `use_espeak_phonemes` from tests
erogol Jan 7, 2022
90fe858
Fix synthesis.py 🔧
erogol Jan 7, 2022
bddcc9d
Fixes small compat. issues
erogol Jan 7, 2022
c35b0c9
Update Vits for the new model API
erogol Jan 7, 2022
b2e1420
Update train_tts for the new API
erogol Jan 7, 2022
83b6cf5
Extend glow_tts model tests
erogol Jan 12, 2022
5a1d2de
Add verbose option to AudioProcessor
erogol Jan 12, 2022
3c9e518
Fix tokenizer init_from_config
erogol Jan 12, 2022
9d9a5b3
Fix glow_tts_config missing field
erogol Jan 12, 2022
79a5400
Add get_tests_data_path
erogol Jan 12, 2022
0919578
Make lint
erogol Jan 12, 2022
26be609
Extend unittests
erogol Jan 13, 2022
4b612d7
Make lint
erogol Jan 13, 2022
2433626
Fix tests
erogol Jan 14, 2022
911b2db
Fix docstring
erogol Jan 14, 2022
2472d43
Allow padding for shorter segments
erogol Jan 21, 2022
8c555d3
Implement `start_by_longest` option for TTSDatase
erogol Jan 21, 2022
c3ae114
Refactor VITS model
erogol Jan 21, 2022
c94112f
Update GAN model
erogol Jan 25, 2022
2386d80
Take file extension as an argument
erogol Jan 25, 2022
269f8c6
Update synthesizer to use iinit_from_config
erogol Jan 25, 2022
a27133d
Add pitch_fmin pitch_fmax args to the audio
erogol Jan 25, 2022
2303c91
Plot pitch over input characters
erogol Jan 25, 2022
a8352d9
Update language manager
erogol Jan 25, 2022
8397502
Update forwardtts
erogol Jan 25, 2022
c2d5be5
Fix dataset preprocessing
erogol Jan 25, 2022
ad98306
Update FastPitchConfig
erogol Jan 25, 2022
f966a45
Make style
erogol Jan 25, 2022
153c875
Update AnalyzeDataset notebook
erogol Jan 25, 2022
f912206
Load right char class dynamically
erogol Jan 28, 2022
0b8acaf
Add new speakers to the vits model
erogol Jan 28, 2022
a164485
Fix up
erogol Jan 28, 2022
6c55245
Fix VCTK VITS recipe
erogol Jan 28, 2022
a68fb76
Set `drop_last`
erogol Jan 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,8 @@ disable=missing-docstring,
exception-escape,
comprehension-escape,
duplicate-code,
not-callable
not-callable,
import-outside-toplevel

# Enable the message, report, category or checker with the given id(s). You can
# either give multiple identifier separated by comma (,) or put this option
Expand Down
26 changes: 13 additions & 13 deletions TTS/bin/extract_tts_spectrograms.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,28 +13,28 @@
from TTS.tts.datasets import TTSDataset, load_tts_samples
from TTS.tts.models import setup_model
from TTS.tts.utils.speakers import SpeakerManager
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.utils.generic_utils import count_parameters

use_cuda = torch.cuda.is_available()


def setup_loader(ap, r, verbose=False):
tokenizer, _ = TTSTokenizer.init_from_config(c)
dataset = TTSDataset(
r,
c.text_cleaner,
outputs_per_step=r,
compute_linear_spec=False,
meta_data=meta_data,
samples=meta_data,
tokenizer=tokenizer,
ap=ap,
characters=c.characters if "characters" in c.keys() else None,
add_blank=c["add_blank"] if "add_blank" in c.keys() else False,
batch_group_size=0,
min_seq_len=c.min_seq_len,
max_seq_len=c.max_seq_len,
min_text_len=c.min_text_len,
max_text_len=c.max_text_len,
min_audio_len=c.min_audio_len,
max_audio_len=c.max_audio_len,
phoneme_cache_path=c.phoneme_cache_path,
use_phonemes=c.use_phonemes,
phoneme_language=c.phoneme_language,
enable_eos_bos=c.enable_eos_bos_chars,
precompute_num_workers=0,
use_noise_augment=False,
verbose=verbose,
speaker_id_mapping=speaker_manager.speaker_ids if c.use_speaker_embedding else None,
Expand All @@ -44,7 +44,7 @@ def setup_loader(ap, r, verbose=False):
if c.use_phonemes and c.compute_input_seq_cache:
# precompute phonemes to have a better estimate of sequence lengths.
dataset.compute_input_seq(c.num_loader_workers)
dataset.sort_and_filter_items(c.get("sort_by_audio_len", default=False))
dataset.preprocess_samples()

loader = DataLoader(
dataset,
Expand Down Expand Up @@ -75,8 +75,8 @@ def set_filename(wav_path, out_path):

def format_data(data):
# setup input data
text_input = data["text"]
text_lengths = data["text_lengths"]
text_input = data["token_id"]
text_lengths = data["token_id_lengths"]
mel_input = data["mel"]
mel_lengths = data["mel_lengths"]
item_idx = data["item_idxs"]
Expand Down
7 changes: 4 additions & 3 deletions TTS/bin/find_unique_phonemes.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,15 @@

from TTS.config import load_config
from TTS.tts.datasets import load_tts_samples
from TTS.tts.utils.text import text2phone
from TTS.tts.utils.text.phonemizers.gruut_wrapper import Gruut

phonemizer = Gruut(language="en-us")


def compute_phonemes(item):
try:
text = item[0]
language = item[-1]
ph = text2phone(text, language, use_espeak_phonemes=c.use_espeak_phonemes).split("|")
ph = phonemizer.phonemize(text).split("|")
except:
return []
return list(set(ph))
Expand Down
11 changes: 10 additions & 1 deletion TTS/bin/resample.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ def resample_file(func_args):
--input_dir /root/LJSpeech-1.1/
--output_sr 22050
--output_dir /root/resampled_LJSpeech-1.1/
--file_ext wav
--n_jobs 24
""",
formatter_class=RawTextHelpFormatter,
Expand Down Expand Up @@ -55,6 +56,14 @@ def resample_file(func_args):
help="Path of the destination folder. If not defined, the operation is done in place",
)

parser.add_argument(
"--file_ext",
type=str,
default="wav",
required=False,
help="Extension of the audio files to resample",
)

parser.add_argument(
"--n_jobs", type=int, default=None, help="Number of threads to use, by default it uses all cores"
)
Expand All @@ -67,7 +76,7 @@ def resample_file(func_args):
args.input_dir = args.output_dir

print("Resampling the audio files...")
audio_files = glob.glob(os.path.join(args.input_dir, "**/*.wav"), recursive=True)
audio_files = glob.glob(os.path.join(args.input_dir, f"**/*.{args.file_ext}"), recursive=True)
print(f"Found {len(audio_files)} files...")
audio_files = list(zip(audio_files, len(audio_files) * [args.output_sr]))
with Pool(processes=args.n_jobs) as p:
Expand Down
36 changes: 2 additions & 34 deletions TTS/bin/train_tts.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
import os

from TTS.config import check_config_and_model_args, get_from_config_or_model_args, load_config, register_config
from TTS.config import load_config, register_config
from TTS.trainer import Trainer, TrainingArgs
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models import setup_model
from TTS.tts.utils.languages import LanguageManager
from TTS.tts.utils.speakers import SpeakerManager
from TTS.utils.audio import AudioProcessor


def main():
Expand Down Expand Up @@ -42,36 +39,8 @@ def main():
# load training samples
train_samples, eval_samples = load_tts_samples(config.datasets, eval_split=True)

# setup audio processor
ap = AudioProcessor(**config.audio)

# init speaker manager
if check_config_and_model_args(config, "use_speaker_embedding", True):
speaker_manager = SpeakerManager(data_items=train_samples + eval_samples)
if hasattr(config, "model_args"):
config.model_args.num_speakers = speaker_manager.num_speakers
else:
config.num_speakers = speaker_manager.num_speakers
elif check_config_and_model_args(config, "use_d_vector_file", True):
speaker_manager = SpeakerManager(d_vectors_file_path=get_from_config_or_model_args(config, "d_vector_file"))
if hasattr(config, "model_args"):
config.model_args.num_speakers = speaker_manager.num_speakers
else:
config.num_speakers = speaker_manager.num_speakers
else:
speaker_manager = None

if hasattr(config, "use_language_embedding") and config.use_language_embedding:
language_manager = LanguageManager(config=config)
if hasattr(config, "model_args"):
config.model_args.num_languages = language_manager.num_languages
else:
config.num_languages = language_manager.num_languages
else:
language_manager = None

# init the model from config
model = setup_model(config, speaker_manager, language_manager)
model = setup_model(config, train_samples + eval_samples)

# init the trainer and 🚀
trainer = Trainer(
Expand All @@ -81,7 +50,6 @@ def main():
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
training_assets={"audio_processor": ap},
parse_command_line_args=False,
)
trainer.fit()
Expand Down
9 changes: 9 additions & 0 deletions TTS/config/shared_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,12 @@ class BaseAudioConfig(Coqpit):
do_amp_to_db_mel (bool, optional):
enable/disable amplitude to dB conversion of mel spectrograms. Defaults to True.

pitch_fmax (float, optional):
Maximum frequency of the F0 frames. Defaults to ```640```.

pitch_fmin (float, optional):
Minimum frequency of the F0 frames. Defaults to ```0```.

trim_db (int):
Silence threshold used for silence trimming. Defaults to 45.

Expand Down Expand Up @@ -135,6 +141,9 @@ class BaseAudioConfig(Coqpit):
spec_gain: int = 20
do_amp_to_db_linear: bool = True
do_amp_to_db_mel: bool = True
# f0 params
pitch_fmax: float = 640.0
pitch_fmin: float = 0.0
# normalization params
signal_norm: bool = True
min_level_db: int = -100
Expand Down
13 changes: 5 additions & 8 deletions TTS/tts/configs/fast_pitch_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,12 +89,9 @@ class FastPitchConfig(BaseTTSConfig):
pitch_loss_alpha (float):
Weight for the pitch predictor's loss. If set 0, disables the pitch predictor. Defaults to 1.0.

binary_loss_alpha (float):
binary_align_loss_alpha (float):
Weight for the binary loss. If set 0, disables the binary loss. Defaults to 1.0.

binary_align_loss_start_step (int):
Start binary alignment loss after this many steps. Defaults to 20000.

min_seq_len (int):
Minimum input sequence length to be used at training.

Expand Down Expand Up @@ -129,12 +126,12 @@ class FastPitchConfig(BaseTTSConfig):
duration_loss_type: str = "mse"
use_ssim_loss: bool = True
ssim_loss_alpha: float = 1.0
dur_loss_alpha: float = 1.0
spec_loss_alpha: float = 1.0
pitch_loss_alpha: float = 1.0
aligner_loss_alpha: float = 1.0
binary_align_loss_alpha: float = 1.0
binary_align_loss_start_step: int = 20000
pitch_loss_alpha: float = 0.1
dur_loss_alpha: float = 0.1
binary_align_loss_alpha: float = 0.1
binary_loss_warmup_epochs: int = 150

# overrides
min_seq_len: int = 13
Expand Down
1 change: 1 addition & 0 deletions TTS/tts/configs/glow_tts_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ class GlowTTSConfig(BaseTTSConfig):

# multi-speaker settings
use_speaker_embedding: bool = False
speakers_file: str = None
use_d_vector_file: bool = False
d_vector_file: str = False

Expand Down
18 changes: 16 additions & 2 deletions TTS/tts/configs/shared_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@ class CharactersConfig(Coqpit):
"""Defines arguments for the `BaseCharacters` and its subclasses.

Args:
characters_class (str):
Defines the class of the characters used. If None, we pick ```Phonemes``` or ```Graphemes``` based on
the configuration. Defaults to None.

pad (str):
characters in place of empty padding. Defaults to None.

Expand All @@ -78,12 +82,13 @@ class CharactersConfig(Coqpit):

is_unique (bool):
remove any duplicate characters in the character lists. It is a bandaid for compatibility with the old
models trained with character lists with duplicates.
models trained with character lists with duplicates. Defaults to True.

is_sorted (bool):
Sort the characters in alphabetical order. Defaults to True.
"""

characters_class: str = None
pad: str = None
eos: str = None
bos: str = None
Expand Down Expand Up @@ -166,9 +171,16 @@ class BaseTTSConfig(BaseTrainingConfig):
compute_linear_spec (bool):
If True data loader computes and returns linear spectrograms alongside the other data.

precompute_num_workers (int):
Number of workers to precompute features. Defaults to 0.

use_noise_augment (bool):
Augment the input audio with random noise.

start_by_longest (bool):
If True, the data loader will start loading the longest batch first. It is useful for checking OOM issues.
Defaults to False.

add_blank (bool):
Add blank characters between each other two characters. It improves performance for some models at expense
of slower run-time due to the longer input sequence.
Expand Down Expand Up @@ -207,6 +219,7 @@ class BaseTTSConfig(BaseTrainingConfig):
phoneme_cache_path: str = None
# vocabulary parameters
characters: CharactersConfig = None
add_blank: bool = False
# training params
batch_group_size: int = 0
loss_masking: bool = None
Expand All @@ -218,8 +231,9 @@ class BaseTTSConfig(BaseTrainingConfig):
max_text_len: int = float("inf")
compute_f0: bool = False
compute_linear_spec: bool = False
precompute_num_workers: int = 0
use_noise_augment: bool = False
add_blank: bool = False
start_by_longest: bool = False
# dataset
datasets: List[BaseDatasetConfig] = field(default_factory=lambda: [BaseDatasetConfig()])
# optimizer
Expand Down
13 changes: 1 addition & 12 deletions TTS/tts/configs/vits_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,15 +67,6 @@ class VitsConfig(BaseTTSConfig):
compute_linear_spec (bool):
If true, the linear spectrogram is computed and returned alongside the mel output. Do not change. Defaults to `True`.

sort_by_audio_len (bool):
If true, dataloder sorts the data by audio length else sorts by the input text length. Defaults to `True`.

min_seq_len (int):
Minimum sequnce length to be considered for training. Defaults to `0`.

max_seq_len (int):
Maximum sequnce length to be considered for training. Defaults to `500000`.

r (int):
Number of spectrogram frames to be generated at a time. Do not change. Defaults to `1`.

Expand Down Expand Up @@ -123,16 +114,14 @@ class VitsConfig(BaseTTSConfig):
feat_loss_alpha: float = 1.0
mel_loss_alpha: float = 45.0
dur_loss_alpha: float = 1.0
aligner_loss_alpha = 1.0
speaker_encoder_loss_alpha: float = 1.0

# data loader params
return_wav: bool = True
compute_linear_spec: bool = True

# overrides
sort_by_audio_len: bool = True
min_seq_len: int = 0
max_seq_len: int = 500000
r: int = 1 # DO NOT CHANGE
add_blank: bool = True

Expand Down
4 changes: 2 additions & 2 deletions TTS/tts/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ def split_dataset(items):
"""Split a dataset into train and eval. Consider speaker distribution in multi-speaker training.

Args:
items (List[List]): A list of samples. Each sample is a list of `[audio_path, text, speaker_id]`.
items (List[List]): A list of samples. Each sample is a list of `[text, audio_path, speaker_id]`.
"""
speakers = [item[-1] for item in items]
is_multi_speaker = len(set(speakers)) > 1
Expand Down Expand Up @@ -52,7 +52,7 @@ def load_tts_samples(

formatter (Callable, optional): The preprocessing function to be applied to create the list of samples. It
must take the root_path and the meta_file name and return a list of samples in the format of
`[[audio_path, text, speaker_id], ...]]`. See the available formatters in `TTS.tts.dataset.formatter` as
`[[text, audio_path, speaker_id], ...]]`. See the available formatters in `TTS.tts.dataset.formatter` as
example. Defaults to None.

Returns:
Expand Down
Loading