Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the toolbox from @mbdash #433

Closed
ghost opened this issue Jul 20, 2020 · 18 comments
Closed

Questions about the toolbox from @mbdash #433

ghost opened this issue Jul 20, 2020 · 18 comments

Comments

@ghost
Copy link

ghost commented Jul 20, 2020

In #432 , @mbdash wrote:

I would have not dared asking for anything, but since you mentioned it...

If I may ask for your opinion on 2 questions I have been thinking about:
(and I hope these are not stupid questions)

Q1

Do you see a way in the future to reduce / tweak the minimum output audio length below the minimum 5 sec?

For example,
Something that would allow input text lengths as low as single words such as:

  • Hi
  • Hi your-name-here
  • How are you
  • I'm fine thank you
  • yes
  • no
  • thank you

My understanding is that the minimum audio output length is around 5 sec.
I have experimented with 90, 70, 60, 50 and 40 characters of input text.
The minimum workable input seem to be 60-70 chars to fill that 5 sec of audio,
below that, the audio output is just weird / creepy.
The sweet spot seems to be a minimum of 80-90 characters to fill nicely the minimum 5 sec audio output.

Q2
this one is a weird one and might go against the design itself...

Would using a dataset purely generated by a single actor, result in a better audio output when reproducing solely that actor's voice?

and if so,

Do you have any guess of how big of a dataset would be required to reproduce the voice of a single voice actor?
1 to 1.
Essentially removing the capacity to reproduce any other voices properly when using that specific model,
for the purpose of achieving better cloning accuracy for a single voice.

ie:
a single voice actor reads 12h of transcript (or more)
then we can generate higher quality TTS for that single actor.

thank you for any feedback.

Originally posted by @mbdash in #432 (comment)

@ghost ghost mentioned this issue Jul 20, 2020
@ghost
Copy link
Author

ghost commented Jul 20, 2020

@mbdash

Q1

Do you see a way in the future to reduce / tweak the minimum output audio length below the minimum 5 sec?

It can be worked around by padding the input with extra words, and then post-processing to remove the padding. For example @plummet555 found this workaround (#360 (comment)):

  1. I found that often there would be a harsh pop or other artifact at the start of the audio. I did a lot of experimenting with that. In the end, I added the word 'clip' to the start of every input sentence, then removed it from the output with silence detection (find the first gap in the output audio)

Q2

Would using a dataset purely generated by a single actor, result in a better audio output when reproducing solely that actor's voice?

Yes, the toolbox should perform much better on speakers that it is trained on.

Do you have any guess of how big of a dataset would be required to reproduce the voice of a single voice actor?
1 to 1.
Essentially removing the capacity to reproduce any other voices properly when using that specific model,
for the purpose of achieving better cloning accuracy for a single voice.

If I were attempting this, I would extract a single embedding for the desired speaker and then fine-tune the synthesizer and vocoder models using that hardcoded embedding, following this training process: #429 (comment) (substituting your single-speaker dataset for the accent datasets).

The amount of data would depend on how well the existing models work on the target speaker.

It would be an interesting project to attempt an open-source version of the resemble.ai voice cloner, where we define a set of utterances to be recorded and fine-tune a single-speaker model using the above process. I would guess 5-10 minutes of data should be sufficient for most voices.

@ghost ghost changed the title Questions about the toolbox Questions about the toolbox from @mbdash Jul 20, 2020
@ghost
Copy link
Author

ghost commented Jul 20, 2020

Also see here for more info on what it would take to properly improve the models to fix your Q1: #364 (comment) . The problem has been reported previously in #53 and #227 .

@mbdash
Copy link
Collaborator

mbdash commented Jul 20, 2020

Re: Q1:
I had a similar idea, but not as good as @plummet555,
(pad the input and cut it out afterwards)
But cutting after the 1st silence looks like a better direction to take.

Re: Q2:
Yes, an open-source version of the resemble.ai voice cloner was what I had in mind.

I will look into both your suggestions.
thanks again.

@plummet555
Copy link

I'd be happy to share the change I made to add and then cut out a leading word. Just need to find some time to tidy up the repo.

@mbdash
Copy link
Collaborator

mbdash commented Jul 21, 2020

@plummet555, That would be great!

This is my 1st time interacting with a GitHub community and you guys are awesome. I originally really just wanted to say thanks to @blue-fish and wasn't expecting this kind of exchange.

I will try to learn how to use GitHub properly to follow your example and also share back any changes I will make to the libraries I currently am experimenting with.

@plummet555
Copy link

plummet555 commented Jul 24, 2020

Looks like the project has moved forward (which is great!) but I think it will be a while before I get a chance to rebase and try it out. So, if it helps. for now I'll just copy here the code I wrote to add the word 'skip' to the start of each line, then to find the silence following it so it can be trimmed back out from the output. It's a copy of demo_cli.py (which I called sv2tts_cli.py).

You can run it as e.g.:
python3 sv2tts_cli.py input_sample.wav input.txt exported.mp3 --cpu

where input.txt contains one or more lines of text. --cpu is optional.

Hope this helps

import warnings
warnings.filterwarnings('ignore',category=FutureWarning)
warnings.filterwarnings('ignore',category=DeprecationWarning)
warnings.filterwarnings('ignore',message="The name tf.nn.rnn_cell.RNNCell is deprecated. Please use tf.compat.v1.nn.rnn_cell.RNNCell instead.")
#import tensorflow.python.util.deprecation as deprecation
#deprecation._PRINT_DEPRECATION_WARNINGS = False

import traceback
from encoder.params_model import model_embedding_size as speaker_embedding_size
from utils.argutils import print_args
from synthesizer.inference import Synthesizer
from encoder import inference as encoder
from vocoder import inference as vocoder
from pathlib import Path
import numpy as np
import librosa
import argparse
import torch
import sys
from pydub import effects
from scipy.io.wavfile import read
from pydub import AudioSegment
from pydub.silence import detect_silence
from pydub.playback import play
import ffmpeg

##Expects numpy array of float
def find_silence(samples, rate):
    tenms = int(rate / 100)
    ms = int(rate / 1000)

    threshold  = 0.04 #lower means more is considered to be noise, higher means more is considered to be silence
    run_threshold = 5
    silence_count = 0
    mids=[]
    last_block = int(samples.size / tenms) * tenms

    for outer in range(0,last_block,tenms):
        average = 0

        for inner in range(outer,outer+tenms-1):
            average += abs(samples[inner])

        #print ("%d,%f" %(outer/ms, average/tenms))
        if ((average/tenms >= threshold) or (outer == last_block - tenms)):
            if (silence_count >= run_threshold):
                start_sample = outer - (silence_count * tenms)
                end_sample = outer -1
                mid_sample = (start_sample + end_sample) /2
                #print ("silence found %d, %d, mid %d" %(start_sample / ms, end_sample / ms, mid_sample/ms))
                mids.append(mid_sample/ms)
            silence_count = 0

        else:
            silence_count = silence_count+ 1

    return mids

if __name__ == '__main__':
    ## Info & args
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    parser.add_argument("reference", type=Path,
                        help="Path to a reference file")
    parser.add_argument("input", type=Path,
                        help="Path to an input file")
    parser.add_argument("output", type=Path,
                        help="Path to an output file")
    parser.add_argument("-e", "--enc_model_fpath", type=Path,
                        default="encoder/saved_models/pretrained.pt",
                        help="Path to a saved encoder")
    parser.add_argument("-s", "--syn_model_dir", type=Path,
                        default="synthesizer/saved_models/logs-pretrained/",
                        help="Directory containing the synthesizer model")
    parser.add_argument("-v", "--voc_model_fpath", type=Path,
                        default="vocoder/saved_models/pretrained/pretrained.pt",
                        help="Path to a saved vocoder")
    parser.add_argument("--low_mem", action="store_true", help=\
        "If True, the memory used by the synthesizer will be freed after each use. Adds large "
        "overhead but allows to save some GPU memory for lower-end GPUs.")
    parser.add_argument("--no_sound", action="store_true", help=\
        "If True, audio won't be played.")
    parser.add_argument(
        '--cpu', help='Use CPU.', action='store_true')
    args = parser.parse_args()
    print_args(args, parser)
    if not args.no_sound:
        import sounddevice as sd


    ## Print some environment information (for debugging purposes)
    print("Running a test of your configuration...\n")
    if args.cpu:
        encoder.load_model(args.enc_model_fpath)
    elif torch.cuda.is_available():
        device_id = torch.cuda.current_device()
        gpu_properties = torch.cuda.get_device_properties(device_id)
        print("Found %d GPUs available. Using GPU %d (%s) of compute capability %d.%d with "
            "%.1fGb total memory.\n" %
            (torch.cuda.device_count(),
            device_id,
            gpu_properties.name,
            gpu_properties.major,
            gpu_properties.minor,
            gpu_properties.total_memory / 1e9))
    else:
        print("Your PyTorch installation is not configured. If you have a GPU ready "
              "for deep learning, ensure that the drivers are properly installed, and that your "
              "CUDA version matches your PyTorch installation.", file=sys.stderr)
        quit(-1)

    ## Load the models one by one.
    print("Preparing the encoder, the synthesizer and the vocoder...")
    encoder.load_model(args.enc_model_fpath)
    synthesizer = Synthesizer(args.syn_model_dir.joinpath("taco_pretrained"), low_mem=args.low_mem)
    vocoder.load_model(args.voc_model_fpath)


    ## Run a test
    #print("Testing your configuration with small inputs.")
    # Forward an audio waveform of zeroes that lasts 1 second. Notice how we can get the encoder's
    # sampling rate, which may differ.
    # If you're unfamiliar with digital audio, know that it is encoded as an array of floats
    # (or sometimes integers, but mostly floats in this projects) ranging from -1 to 1.
    # The sampling rate is the number of values (samples) recorded per second, it is set to
    # 16000 for the encoder. Creating an array of length <sampling_rate> will always correspond
    # to an audio of 1 second.
    #print("\tTesting the encoder...")
    #encoder.embed_utterance(np.zeros(encoder.sampling_rate))

    # Create a dummy embedding. You would normally use the embedding that encoder.embed_utterance
    # returns, but here we're going to make one ourselves just for the sake of showing that it's
    # possible.
    #embed = np.random.rand(speaker_embedding_size)
    # Embeddings are L2-normalized (this isn't important here, but if you want to make your own
    # embeddings it will be).
    #embed /= np.linalg.norm(embed)
    # The synthesizer can handle multiple inputs with batching. Let's create another embedding to
    # illustrate that
    #embeds = [embed, np.zeros(speaker_embedding_size)]
    #texts = ["test 1", "test 2"]
    #print("\tTesting the synthesizer... (loading the model will output a lot of text)")
    #mels = synthesizer.synthesize_spectrograms(texts, embeds)

    # The vocoder synthesizes one waveform at a time, but it's more efficient for long ones. We
    # can concatenate the mel spectrograms to a single one.
    #mel = np.concatenate(mels, axis=1)
    # The vocoder can take a callback function to display the generation. More on that later. For
    # now we'll simply hide it like this:
    #no_action = lambda *args: None
    #print("\tTesting the vocoder...")
    # For the sake of making this test short, we'll pass a short target length. The target length
    # is the length of the wav segments that are processed in parallel. E.g. for audio sampled
    # at 16000 Hertz, a target length of 8000 means that the target audio will be cut in chunks of
    # 0.5 seconds which will all be generated together. The parameters here are absurdly short, and
    # that has a detrimental effect on the quality of the audio. The default parameters are
    # recommended in general.
    #vocoder.infer_waveform(mel, target=200, overlap=50, progress_callback=no_action)

    #print("All test passed! You can now synthesize speech.\n\n")


    ## Interactive speech generation

    try:
        # Get the reference audio filepath
        message = "Reference voice: enter an audio filepath of a voice to be cloned (mp3, " \
                  "wav, m4a, flac, ...):\n"
        in_fpath = args.reference

        ## Computing the embedding
        # First, we load the wav using the function that the speaker encoder provides. This is
        # important: there is preprocessing that must be applied.

        # The following two methods are equivalent:
        # - Directly load from the filepath:
        preprocessed_wav = encoder.preprocess_wav(in_fpath)
        # - If the wav is already loaded:
        original_wav, sampling_rate = librosa.load(in_fpath)
        preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
        print("Loaded file succesfully")

        # Then we derive the embedding. There are many functions and parameters that the
        # speaker encoder interfaces. These are mostly for in-depth research. You will typically
        # only use this function (with its default parameters):
        embed = encoder.embed_utterance(preprocessed_wav)
        print("Created the embedding")


        ## Generating the spectrogram
        f = open(args.input,"r")
        text = f.read()
        f.close()
        print (text)

        # The synthesizer works in batch, so you need to put your data in a list or numpy array
        input_texts = text.splitlines()
        audio_np_full = None

        for line in input_texts:

            text = "skip " + line
            texts=text
            texts=[texts]

            embeds = [embed] * len(texts)
            # If you know what the attention layer alignments are, you can retrieve them here by
            # passing return_alignments=True
            specs = synthesizer.synthesize_spectrograms(texts, embeds)
            print("Created the mel spectrogram")

            spec=specs[0]

            ## Generating the waveform
            print("Synthesizing the waveform:")
            # Synthesizing the waveform is fairly straightforward. Remember that the longer the
            # spectrogram, the more time-efficient the vocoder.

            ## Post-generation
            # There's a bug with sounddevice that makes the audio cut one second earlier, so we
            # pad it.
            generated_wav = np.pad(vocoder.infer_waveform(spec, overlap=800, target=8000, normalize=True, batched=False), (0, synthesizer.sample_rate), mode="constant")
            #at this point we have floats in the range -1.0 to 1.0
            #sd.play(generated_wav, synthesizer.sample_rate)
            #sd.wait()
            #librosa.output.write_wav("pretrim.wav", generated_wav, synthesizer.sample_rate)

            mids = find_silence(generated_wav, synthesizer.sample_rate)
            print ("Silences:")
            print (mids)

            skip_ms=400
            for mid in mids:
                if ((mid > 350) and (mid < 650)):
                    skip_ms = mid
                    break

            skip_pos = int(skip_ms * synthesizer.sample_rate / 1000)
            print ("Skip mid %d, pos %d" %(skip_ms, skip_pos))
            if (audio_np_full is not None):
                audio_np_full = np.append(audio_np_full, generated_wav[skip_pos:])
            else:
                audio_np_full = generated_wav[skip_pos:]

        flt = librosa.util.buf_to_float(audio_np_full, n_bytes=2)
        librosa.output.write_wav("output.wav", audio_np_full.astype(np.float32), synthesizer.sample_rate)
        #sd.play(audio_np_full, synthesizer.sample_rate)
        #sd.wait()
        #if not args.no_sound:
    #        play(audio_segment_full)
        #audio_segment = AudioSegment.from_file(args.output, format="wav")
        #audio_segment.export("output.mp3", format="mp3")

        print (args.output)
        ffmpeg.input("output.wav").filter("loudnorm",I=-14, TP=-3, LRA=11).output(str(args.output)).overwrite_output().run()

        # ffmpeg -i test.mp3 -af loudnorm=I=-14:TP=-3:LRA=11:print_format=json -f null -

        print ("Done")
        exit(0)

    except Exception as e:
        print("Caught exception: %s" % repr(e))
        print (e)
        traceback.print_exc()
        exit(1)


@ghost
Copy link
Author

ghost commented Jul 26, 2020

Thank you for sharing your code with us @plummet555 .

Please ask any follow-up questions as needed @mbdash and close the issue when you are satisfied.

@ghost
Copy link
Author

ghost commented Jul 26, 2020

@mbdash wrote this in #449 but I am moving it here just to keep the issues organized:

Also, as a side note,
the test you did using a dataset of 1 voice had great results! After training on LibriTTS the result should be even more amazing.
I was just lying in bed yesterday night and thinking about the current RTVC potential and was wondering what you would think about this:
AzamRabiee/Emotional-TTS
see this: https://youtu.be/bh2HP0n2ik8
Have you seen that one before?

In general it takes a lot of effort to make a practical implementation of whatever is demonstrated in research papers. This project is one example, and Corentin made a masters thesis out of it, which is on the order of 1,000 hours of work. So my reaction to most new research tends to be "cool, but I'll wait for someone else to build it." Just because you can do it doesn't mean you should. Life is too short.

@mbdash
Copy link
Collaborator

mbdash commented Jul 26, 2020

haha Great answer.
But again.
I did not ask for you to implement it, but simply your opinion.
And I understand what you meant. It is not that simple and would requires lots of work.

thx again.

@ghost
Copy link
Author

ghost commented Jul 26, 2020

I don't understand Korean so the demo didn't make much of an impact. And I am also new to TTS and ML in general so I can't claim to understand the paper either. The general concept is promising though. I wonder if others have attempted something similar.

From the paper:

The experiments were performed on our internal Korean dataset, containing seven emotions (the neutral and six basic emotions) uttered by a male and female speaker. Every style category has 3000 sentences, recorded in 16kHz sampling rate.

Training a multispeaker TTS requires a lot of input data, which could also be used to train an "emotion encoder" to automatically assign (P,A,D) values based on clues from text and the recorded speech. (Section 2.2 says that the actual model uses 32 dimensions for emotion so the emotion encoder could output in 32-D.) Then use that in synthesizer training. I think it should generalize well because how the emotion manifests itself in an utterance should be independent of the voice of the person speaking it. Furthermore you could also use the info to correct or normalize the utterance embeddings generated by the speaker encoder.

@ghost
Copy link
Author

ghost commented Jul 27, 2020

@mbdash If you don't have plans for your GPU after the LibriTTS model finishes, would you be willing to help train a new encoder for better voice cloning quality?

You would use the same process as wiki/Training, but change the params for a hidden layer size of 768 instead of the current 256. There is a lot of info on this in #126 but the model in that issue was trained with an output size of 768 which makes it incompatible with everything else we have. According to wiki/Pretrained-models the current encoder trained to 1.56M steps in 20 days on a 1080ti.

@mbdash
Copy link
Collaborator

mbdash commented Jul 27, 2020

I will gladly put my GPU to work whenever I can.

We could go in milestones,
IE:
reach 250k on the Synth model, (92k as we speak)
then bring the Encoder model on par,
then switch back to the Synth and bring it to 500k
back and forth to 750k, 1M 1.25M etc

Eventually I will need it for other things,
but until then, I can put it to good use.

@ghost
Copy link
Author

ghost commented Jul 27, 2020

Thank you for contributing your time and hardware @mbdash .

If you've had a chance to look at the figure in #30 (comment) you'll notice that:

  1. The encoder makes training embeds for the synthesizer,
  2. The encoder and synthesizer are used to make training mels for the vocoder.

If an upstream element is changed, the downstream elements need to be retrained in most cases. Therefore if we are changing the encoder we should also retrain or at least finetune the synthesizer. If the synthesizer changes, then similarly update the vocoder.

So we should make our best effort to train the encoder and do any follow-on work with the synth and vocoder. If it turns out that the outputs are similar enough, we should be able to jump back and forth and finetune as you are proposing. Though it may still be good to proceed serially if there is any desire to make the training process repeatable for those who want to improve on the models in the future.

@mbdash
Copy link
Collaborator

mbdash commented Jul 27, 2020

Alright, then, let's switch training on the Encoder then.
We just passed the 103k on Synth.

Provide me with the instructions and i'll do it

@ghost
Copy link
Author

ghost commented Jul 27, 2020

Please start by downloading the following datasets. These datasets are huge!

  • LibriSpeech: train-other-500 (extract as LibriSpeech/train-other-500)
  • VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv)
  • VoxCeleb2: Dev A - H (extract as VoxCeleb2/dev)

Later I'll open a new issue for the encoder training and post instructions there.

@mbdash
Copy link
Collaborator

mbdash commented Jul 28, 2020

I already got train-other-500.
VoxCeleb 1 and 2 are password protected... so i gotta wait for an email with a password.
I think we should eventually get a Slack to be more efficient with comms.
I will wait for instructions and will let you know when i have all the files.
until then, i will let the Synth train and reach 125k,
Cheers!

@ghost
Copy link
Author

ghost commented Jul 28, 2020

Just to double-check, you have the LibriSpeech (not LibriTTS) version of train-other-500? I know it's not going to make a big difference but I'd prefer the LibriSpeech version so we can precisely replicate Corentin's setup with only one change (hidden model size).

Let's also plan on training the synth to 278k before switching to encoder training. In the meantime I am trying to work out the pytorch synthesizer (#447).

@ghost
Copy link
Author

ghost commented Aug 7, 2020

Questions have been answered, and this issue is inactive. @mbdash Thanks again for inspiring #437 and for your ongoing contributions towards better models. Feel free to open another issue anytime if you have anything to discuss.

@ghost ghost closed this as completed Aug 7, 2020
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants