Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The pretrained model does not perform well compared to the youtube video #162

Closed
IMLHF opened this issue Oct 12, 2019 · 8 comments
Closed

Comments

@IMLHF
Copy link

IMLHF commented Oct 12, 2019

I write a script for model evaluation instead of using your toolbox. In the script, I load the pretained model you provided and evaluate the whole model using new reference audio. But the synthesis performance is not welI as you showed in the Youtube demo. I also put the code under this issue. Could you kindly point out my problem or give me some guidance to reproduce your results? I really appreciate your help!

@IMLHF
Copy link
Author

IMLHF commented Oct 12, 2019

import argparse
import os
import re
import numpy as np
import soundfile as sf
from encoder import inference as encoder_infer
from synthesizer import inference as syn_infer
from encoder import audio as encoder_audio
from synthesizer import audio
from functools import partial
import pypinyin
from synthesizer.hparams import hparams


def run_eval_part1(args):
  speaker_enc_ckpt = args.speaker_encoder_checkpoint
  syn_ckpt = args.syn_checkpoint
  speaker_name = args.speaker_name
  eval_results_dir = os.path.join(args.eval_results_dir,
                                  speaker_name)
  if not os.path.exists(eval_results_dir):
    os.makedirs(eval_results_dir)
  speaker_audio_dirs = {
      "speaker_name": ["speaker_audio_1.wav", "speaker_audio_2.wav"],
      "vctk_p225": ["/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p225/p225_001.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p225/p225_002.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p225/p225_003.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p225/p225_004.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p225/p225_005.wav",
                    ],
      "vctk_p226": ["/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p226/p226_001.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p226/p226_002.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p226/p226_003.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p226/p226_004.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p226/p226_005.wav",
                    ],
      "vctk_p227": ["/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p227/p227_001.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p227/p227_002.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p227/p227_003.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p227/p227_004.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p227/p227_005.wav",
                    ],
      "vctk_p228": ["/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p228/p228_001.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p228/p228_002.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p228/p228_003.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p228/p228_004.wav",
                    "/home/zhangwenbo5/lihongfeng/corpus/vctk_dataset/wav16/p228/p228_005.wav",
                    ],
      "biaobei_speaker": ["/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000001.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000002.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000003.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000004.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000005.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000006.wav",
                          "/home/zhangwenbo5/lihongfeng/corpus/BZNSYP/wavs/000007.wav",
                          ],
      "aishell_C0002": ["/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0002/IC0002W0001.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0002/IC0002W0002.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0002/IC0002W0003.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0002/IC0002W0004.wav", ],
      "aishell_C0896": ["/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0896/IC0896W0001.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0896/IC0896W0002.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0896/IC0896W0003.wav",
                        "/home/zhangwenbo5/lihongfeng/corpus/aishell2/data/wav/C0896/IC0896W0004.wav", ],
  }[speaker_name]
  sentences = [
    "THAT MATTER OF TROY AND ACHILLES WRATH ONE TWO THREE RATS",
    "ENDED THE QUEST OF THE HOLY GRAAL JERUSALEM A HANDFUL OF ASHES BLOWN BY THE WIND EXTINCT",
    "She can scoop these things into three red bags",
    "and we will go meet her Wednesday at the train station",
    "This was demonstrated in a laboratory experiment with rats."
  ]

  sentences = [sen.upper() for sen in sentences]

  sentences.append("This was demonstrated in a laboratory experiment with rats")

  print('eval part1> model: %s.' % syn_ckpt)
  syner = syn_infer.Synthesizer(syn_ckpt)
  encoder_infer.load_model(speaker_enc_ckpt)

  ckpt_step = "pretrained"

  speaker_audio_wav_list = [encoder_audio.preprocess_wav(wav_dir) for wav_dir in speaker_audio_dirs]
  speaker_audio_wav = np.concatenate(speaker_audio_wav_list)
  print(os.path.join(eval_results_dir, '%s-000_refer_speaker_audio.wav' % speaker_name))
  audio.save_wav(speaker_audio_wav, os.path.join(eval_results_dir, '%s-000_refer_speaker_audio.wav' % speaker_name),
                 hparams.sample_rate)
  speaker_embed = encoder_infer.embed_utterance(speaker_audio_wav)
  for i, text in enumerate(sentences):
    path = os.path.join(eval_results_dir,
                        "%s-%s-eval-%03d.wav" % (speaker_name, ckpt_step, i))
    print('[{:<10}]: {}'.format('processing', path))
    mel_spec = syner.synthesize_spectrograms([text], [speaker_embed])[
        0]  # batch synthesize
    print('[{:<10}]:'.format('text:'), text)
    # print(np.shape(mel_spec))
    wav = syner.griffin_lim(mel_spec)
    audio.save_wav(wav, path, hparams.sample_rate)


def main():
  os.environ['CUDA_VISIBLE_DEVICES']= '2'
  parser = argparse.ArgumentParser()
  parser.add_argument('syn_checkpoint',
                      # required=True,
                      help='Path to synthesizer model checkpoint.')
  parser.add_argument('speaker_name',
                      help='Path to target speaker audio.')
  parser.add_argument('--speaker_encoder_checkpoint', default='encoder/saved_models/pretrained.pt',
                      help='Path to speaker encoder nodel checkpoint.')
  parser.add_argument('--eval_results_dir', default='overall_eval_results',
                      help='Overall evaluation results will be saved here.')
  args = parser.parse_args()
  hparams.set_hparam("tacotron_num_gpus", 1) # set tacotron_num_gpus=1 to synthesizer single wav.
  run_eval_part1(args)


if __name__ == '__main__':
  os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
  main()



@rfwatkins
Copy link

rfwatkins commented Oct 16, 2019

Deprecating this response - since it's replicated below in hopes of keeping up s/n.

@IMLHF
Copy link
Author

IMLHF commented Oct 17, 2019

Some of it seems to depend on the quality of the audio file you're using. If it's been compressed at any point for web distribution (youtube videos, mp3s etc) then you often get a choppy and messy sounding voice. Results are far better if you use the closest thing you have to an original and uncompressed wav file. YMMV but this has been my experience.

Added an example here with the same text, trained from identical 48KHz mp3 and then FLAC sources of the same dialog segment:

https://clyp.it/0msznbtd

You can hear considerably more wobble at the start of the first iteration. Restarted between takes. It's a bit subtle here but in the interest of science I'm using raw actual results. In practice I get consistently better results from high-quality sound sources. The characteristics of the voice and the speech used in the training samples matters a lot too.

Thanks for your notes. But i can't open this link. Could you provide any other links, such as google drive ? By the way, do you mean you train the entire model in this github on your own high quality English dataset without any model modifications?

@rfwatkins
Copy link

rfwatkins commented Oct 17, 2019

I wasn't satisfied with the example I gave (new link here https://drive.google.com/open?id=1qZAYTfYe0sUobaOVaYkHDz075FcNWJgy) so I spent the evening running tests with paired high and low quality samples. Since I try to follow the data, I now want to retract my response above. I do still think it's important for getting the best quality results, but it's not the defining factor. Frankly I'm still not sure what that factor is yet - I can't identify it by any spectral features or differences by ear, but the practical upshot is that some subsamples of a voice clone better than others. The key to getting a good voice synthesis seems to me to be testing a range of different samples from the same speaker and discovering the one which clones the best. Some samples which sound fine give garbage results, while others are much better.

When I mentioned training, I meant the voice you're trying to clone - the step of training the vocoder (via the alternative waveRNN model I believe, but I'm no ML specialist) from the sample you're using, which is the bit which interests me right now.

For reference, here's about a minute of audio of three different voices reading three different quotes - this is about as good as I'm getting at the moment: https://drive.google.com/open?id=1vqWj1XPJ2BcWTNKkAbwGji2344sNxLOd

@IMLHF IMLHF changed the title I load the pretained model not perform well compared to the youtube vedio you provided The pretained model not perform well compared to the youtube vedio you provided Oct 18, 2019
@IMLHF
Copy link
Author

IMLHF commented Oct 18, 2019

I wasn't satisfied with the example I gave (new link here https://drive.google.com/open?id=1qZAYTfYe0sUobaOVaYkHDz075FcNWJgy) so I spent the evening running tests with paired high and low quality samples. Since I try to follow the data, I now want to retract my response above. I do still think it's important for getting the best quality results, but it's not the defining factor. Frankly I'm still not sure what that factor is yet - I can't identify it by any spectral features or differences by ear, but the practical upshot is that some subsamples of a voice clone better than others. The key to getting a good voice synthesis seems to me to be testing a range of different samples from the same speaker and discovering the one which clones the best. Some samples which sound fine give garbage results, while others are much better.

When I mentioned training, I meant the voice you're trying to clone - the step of training the vocoder (via the alternative waveRNN model I believe, but I'm no ML specialist) from the sample you're using, which is the bit which interests me right now.

For reference, here's about a minute of audio of three different voices reading three different quotes - this is about as good as I'm getting at the moment: https://drive.google.com/open?id=1vqWj1XPJ2BcWTNKkAbwGji2344sNxLOd

Thanks for your reply. I think our main problem is how to clone the tone of reference speech(i mean the few seconds audio out of the training data) as much as possible. I don't think the audio quality is the main factor leading to the bad performance for tone cloning. My first step is to check if our evaluation script mentioned above has problems or if the pretained model the author provided lead to bad performance.

@ghost ghost changed the title The pretained model not perform well compared to the youtube vedio you provided The pretrained model does not perform well compared to the youtube video Jul 5, 2020
@ghost
Copy link

ghost commented Jul 5, 2020

@CorentinJ Can you explain if there are any differences between the models used for your Youtube demo, and the pretrained models released on the wiki page? Would like to put any speculation to rest.

In #197 , it is noticed that you used a different vocoder called "gen_s_mel_raw" for the video but I don't think that is it.

@CorentinJ
Copy link
Owner

No differences, and the vocoder is the same but with a different name

@ghost
Copy link

ghost commented Oct 6, 2020

Not every reference audio will clone well. The quality depends on whether it is similar to other utterances seen by the encoder and synthesizer during training.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants