Skip to content

Latest commit

 

History

History
107 lines (82 loc) · 3.59 KB

README.md

File metadata and controls

107 lines (82 loc) · 3.59 KB

HierSpeech

  • This code is a unofficial implementation of HierSpeech.
  • The algorithm is based on the following papers:
Lee, S. H., Kim, S. B., Lee, J. H., Song, E., Hwang, M. J., & Lee, S. W. HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis. In Advances in Neural Information Processing Systems.

Structure

  • The structure is derived from HierSpeech, but I made several modifications.
  • The multi-head attention in the FFT Block has been replaced with linearized attention.
  • Discriminator
    • Following the advice of the author of the paper, multi stft discriminator have been applied.
    • To prevent the discriminator from winning, the gradient penalty is applied through R1 regularization.

Supported dataset

Hyper parameters

Before proceeding, please set the pattern, inference, and checkpoint paths in Hyper_Parameters.yaml according to your environment.

  • Sound

    • Setting basic sound parameters.
  • Tokens

    • The number of token.
  • Discriminator

    • If Use_STFT is true, model use period and stft discriminator, except scale.
    • If Use_STFT is false, model use period and scale discriminator, except stft.
  • Train

    • Setting the parameters of training.
  • Inference_Batch_Size

    • Setting the batch size when inference
  • Inference_Path

    • Setting the inference path
  • Checkpoint_Path

    • Setting the checkpoint path
  • Log_Path

    • Setting the tensorboard log path
  • Use_Mixed_Precision

    • Setting using mixed precision
  • Use_Multi_GPU

    • Setting using multi gpu
    • By the nvcc problem, Only linux supports this option.
    • If this is True, device parameter is also multiple like '0,1,2,3'.
    • And you have to change the training command also: please check multi_gpu.sh.
  • Device

    • Setting which GPU devices are used in multi-GPU enviornment.
    • Or, if using only CPU, please set '-1'. (But, I don't recommend while training.)

Generate pattern

python Pattern_Generate.py [parameters]

Parameters

  • -lj
    • The path of LJSpeech dataset
  • -hp
    • The path of hyperparameter.

About phonemizer

  • To phoneme string generate, this repository uses phonimizer library.
  • Please refer here to install phonemizer and backend
  • In Windows, you need more setting to use phonemizer.
    • Please refer here
    • In conda enviornment, the following commands are useful.
      conda env config vars set PHONEMIZER_ESPEAK_PATH='C:\Program Files\eSpeak NG'
      conda env config vars set PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll'

Run

Command

Single GPU

python Train.py -hp <path> -s <int>
  • -hp <path>

    • The hyper paramter file path
    • This is required.
  • -s <int>

    • The resume step parameter.
    • Default is 0.
    • If value is 0, model try to search the latest checkpoint.

Multi GPU

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=32 python -m torch.distributed.launch --nproc_per_node=8 Train.py --hyper_parameters Hyper_Parameters.yaml --port 54322