The pretrained model does not perform well compared to the youtube video #162

IMLHF · 2019-10-12T10:24:29Z

I write a script for model evaluation instead of using your toolbox. In the script, I load the pretained model you provided and evaluate the whole model using new reference audio. But the synthesis performance is not welI as you showed in the Youtube demo. I also put the code under this issue. Could you kindly point out my problem or give me some guidance to reproduce your results? I really appreciate your help!

IMLHF · 2019-10-12T10:25:56Z

import argparse
import os
import re
import numpy as np
import soundfile as sf
from encoder import inference as encoder_infer
from synthesizer import inference as syn_infer
from encoder import audio as encoder_audio
from synthesizer import audio
from functools import partial
import pypinyin
from synthesizer.hparams import hparams


def run_eval_part1(args):
  speaker_enc_ckpt = args.speaker_encoder_checkpoint
  syn_ckpt = args.syn_checkpoint
  speaker_name = args.speaker_name
  eval_results_dir = os.path.join(args.eval_results_dir,
                                  speaker_name)
  if not os.path.exists(eval_results_dir):
    os.makedirs(eval_results_dir)
  speaker_audio_dirs = {
      "speaker_name": ["speaker_audio_1.wav", "speaker_audio_2.wav"],
      "vctk_p225": ["/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p225/p225_001.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p225/p225_002.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p225/p225_003.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p225/p225_004.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p225/p225_005.wav",
                    ],
      "vctk_p226": ["/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p226/p226_001.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p226/p226_002.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p226/p226_003.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p226/p226_004.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p226/p226_005.wav",
                    ],
      "vctk_p227": ["/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p227/p227_001.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p227/p227_002.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p227/p227_003.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p227/p227_004.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p227/p227_005.wav",
                    ],
      "vctk_p228": ["/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p228/p228_001.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p228/p228_002.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p228/p228_003.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p228/p228_004.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p228/p228_005.wav",
                    ],
      "biaobei_speaker": ["/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000001.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000002.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000003.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000004.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000005.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000006.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000007.wav",
                          ],
      "aishell_C0002": ["/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0002/IC0002W0001.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0002/IC0002W0002.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0002/IC0002W0003.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0002/IC0002W0004.wav", ],
      "aishell_C0896": ["/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0896/IC0896W0001.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0896/IC0896W0002.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0896/IC0896W0003.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0896/IC0896W0004.wav", ],
  }[speaker_name]
  sentences = [
    "THAT MATTER OF TROY AND ACHILLES WRATH ONE TWO THREE RATS",
    "ENDED THE QUEST OF THE HOLY GRAAL JERUSALEM A HANDFUL OF ASHES BLOWN BY THE WIND EXTINCT",
    "She can scoop these things into three red bags",
    "and we will go meet her Wednesday at the train station",
    "This was demonstrated in a laboratory experiment with rats."
  ]

  sentences = [sen.upper() for sen in sentences]

  sentences.append("This was demonstrated in a laboratory experiment with rats")

  print('eval part1> model: %s.' % syn_ckpt)
  syner = syn_infer.Synthesizer(syn_ckpt)
  encoder_infer.load_model(speaker_enc_ckpt)

  ckpt_step = "pretrained"

  speaker_audio_wav_list = [encoder_audio.preprocess_wav(wav_dir) for wav_dir in speaker_audio_dirs]
  speaker_audio_wav = np.concatenate(speaker_audio_wav_list)
  print(os.path.join(eval_results_dir, '%s-000_refer_speaker_audio.wav' % speaker_name))
  audio.save_wav(speaker_audio_wav, os.path.join(eval_results_dir, '%s-000_refer_speaker_audio.wav' % speaker_name),
                 hparams.sample_rate)
  speaker_embed = encoder_infer.embed_utterance(speaker_audio_wav)
  for i, text in enumerate(sentences):
    path = os.path.join(eval_results_dir,
                        "%s-%s-eval-%03d.wav" % (speaker_name, ckpt_step, i))
    print('[{:<10}]: {}'.format('processing', path))
    mel_spec = syner.synthesize_spectrograms([text], [speaker_embed])[
        0]  # batch synthesize
    print('[{:<10}]:'.format('text:'), text)
    # print(np.shape(mel_spec))
    wav = syner.griffin_lim(mel_spec)
    audio.save_wav(wav, path, hparams.sample_rate)


def main():
  os.environ['CUDA_VISIBLE_DEVICES']= '2'
  parser = argparse.ArgumentParser()
  parser.add_argument('syn_checkpoint',
                      # required=True,
                      help='Path to synthesizer model checkpoint.')
  parser.add_argument('speaker_name',
                      help='Path to target speaker audio.')
  parser.add_argument('--speaker_encoder_checkpoint', default='encoder/saved_models/pretrained.pt',
                      help='Path to speaker encoder nodel checkpoint.')
  parser.add_argument('--eval_results_dir', default='overall_eval_results',
                      help='Overall evaluation results will be saved here.')
  args = parser.parse_args()
  hparams.set_hparam("tacotron_num_gpus", 1) # set tacotron_num_gpus=1 to synthesizer single wav.
  run_eval_part1(args)


if __name__ == '__main__':
  os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
  main()

rfwatkins · 2019-10-16T17:42:42Z

Deprecating this response - since it's replicated below in hopes of keeping up s/n.

IMLHF · 2019-10-17T07:08:14Z

Some of it seems to depend on the quality of the audio file you're using. If it's been compressed at any point for web distribution (youtube videos, mp3s etc) then you often get a choppy and messy sounding voice. Results are far better if you use the closest thing you have to an original and uncompressed wav file. YMMV but this has been my experience.

Added an example here with the same text, trained from identical 48KHz mp3 and then FLAC sources of the same dialog segment:

https://clyp.it/0msznbtd

You can hear considerably more wobble at the start of the first iteration. Restarted between takes. It's a bit subtle here but in the interest of science I'm using raw actual results. In practice I get consistently better results from high-quality sound sources. The characteristics of the voice and the speech used in the training samples matters a lot too.

Thanks for your notes. But i can't open this link. Could you provide any other links, such as google drive ? By the way, do you mean you train the entire model in this github on your own high quality English dataset without any model modifications?

rfwatkins · 2019-10-17T13:11:51Z

I wasn't satisfied with the example I gave (new link here https://drive.google.com/open?id=1qZAYTfYe0sUobaOVaYkHDz075FcNWJgy) so I spent the evening running tests with paired high and low quality samples. Since I try to follow the data, I now want to retract my response above. I do still think it's important for getting the best quality results, but it's not the defining factor. Frankly I'm still not sure what that factor is yet - I can't identify it by any spectral features or differences by ear, but the practical upshot is that some subsamples of a voice clone better than others. The key to getting a good voice synthesis seems to me to be testing a range of different samples from the same speaker and discovering the one which clones the best. Some samples which sound fine give garbage results, while others are much better.

When I mentioned training, I meant the voice you're trying to clone - the step of training the vocoder (via the alternative waveRNN model I believe, but I'm no ML specialist) from the sample you're using, which is the bit which interests me right now.

For reference, here's about a minute of audio of three different voices reading three different quotes - this is about as good as I'm getting at the moment: https://drive.google.com/open?id=1vqWj1XPJ2BcWTNKkAbwGji2344sNxLOd

IMLHF · 2019-10-18T08:35:23Z

I wasn't satisfied with the example I gave (new link here https://drive.google.com/open?id=1qZAYTfYe0sUobaOVaYkHDz075FcNWJgy) so I spent the evening running tests with paired high and low quality samples. Since I try to follow the data, I now want to retract my response above. I do still think it's important for getting the best quality results, but it's not the defining factor. Frankly I'm still not sure what that factor is yet - I can't identify it by any spectral features or differences by ear, but the practical upshot is that some subsamples of a voice clone better than others. The key to getting a good voice synthesis seems to me to be testing a range of different samples from the same speaker and discovering the one which clones the best. Some samples which sound fine give garbage results, while others are much better.

When I mentioned training, I meant the voice you're trying to clone - the step of training the vocoder (via the alternative waveRNN model I believe, but I'm no ML specialist) from the sample you're using, which is the bit which interests me right now.

For reference, here's about a minute of audio of three different voices reading three different quotes - this is about as good as I'm getting at the moment: https://drive.google.com/open?id=1vqWj1XPJ2BcWTNKkAbwGji2344sNxLOd

Thanks for your reply. I think our main problem is how to clone the tone of reference speech(i mean the few seconds audio out of the training data) as much as possible. I don't think the audio quality is the main factor leading to the bad performance for tone cloning. My first step is to check if our evaluation script mentioned above has problems or if the pretained model the author provided lead to bad performance.

ghost · 2020-07-05T10:49:49Z

@CorentinJ Can you explain if there are any differences between the models used for your Youtube demo, and the pretrained models released on the wiki page? Would like to put any speculation to rest.

In #197 , it is noticed that you used a different vocoder called "gen_s_mel_raw" for the video but I don't think that is it.

CorentinJ · 2020-07-06T03:17:16Z

No differences, and the vocoder is the same but with a different name

ghost · 2020-10-06T20:09:05Z

Not every reference audio will clone well. The quality depends on whether it is similar to other utterances seen by the encoder and synthesizer during training.

IMLHF changed the title ~~I load the pretained model not perform well compared to the youtube vedio you provided~~ The pretained model not perform well compared to the youtube vedio you provided Oct 18, 2019

ghost mentioned this issue Jul 5, 2020

How to make voices sound better? #197

Closed

ghost changed the title ~~The pretained model not perform well compared to the youtube vedio you provided~~ The pretrained model does not perform well compared to the youtube video Jul 5, 2020

ghost mentioned this issue Aug 11, 2020

Using speaker embeddings instead of utterance embeddings #483

Closed

ghost closed this as completed Oct 6, 2020

ghost mentioned this issue Apr 4, 2021

how can i got those vocoders like him?? #721

Closed

Bebaam mentioned this issue Dec 9, 2021

Train Synthetizer in Spanish #941

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The pretrained model does not perform well compared to the youtube video #162

The pretrained model does not perform well compared to the youtube video #162

IMLHF commented Oct 12, 2019 •

edited

Loading

IMLHF commented Oct 12, 2019 •

edited

Loading

rfwatkins commented Oct 16, 2019 •

edited

Loading

IMLHF commented Oct 17, 2019

rfwatkins commented Oct 17, 2019 •

edited

Loading

IMLHF commented Oct 18, 2019

ghost commented Jul 5, 2020

CorentinJ commented Jul 6, 2020

ghost commented Oct 6, 2020

The pretrained model does not perform well compared to the youtube video #162

The pretrained model does not perform well compared to the youtube video #162

Comments

IMLHF commented Oct 12, 2019 • edited Loading

IMLHF commented Oct 12, 2019 • edited Loading

rfwatkins commented Oct 16, 2019 • edited Loading

IMLHF commented Oct 17, 2019

rfwatkins commented Oct 17, 2019 • edited Loading

IMLHF commented Oct 18, 2019

ghost commented Jul 5, 2020

CorentinJ commented Jul 6, 2020

ghost commented Oct 6, 2020

IMLHF commented Oct 12, 2019 •

edited

Loading

IMLHF commented Oct 12, 2019 •

edited

Loading

rfwatkins commented Oct 16, 2019 •

edited

Loading

rfwatkins commented Oct 17, 2019 •

edited

Loading