Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation Status and planned TODOs #4

Closed
27 tasks done
Rayhane-mamah opened this issue Feb 25, 2018 · 152 comments
Closed
27 tasks done

Implementation Status and planned TODOs #4

Rayhane-mamah opened this issue Feb 25, 2018 · 152 comments

Comments

@Rayhane-mamah
Copy link
Owner

Rayhane-mamah commented Feb 25, 2018

this umbrella issue tracks my current progress and discuss priority of planned TODOs. It has been closed since all objectives are hit.

Goal

  • achieve a high quality human-like text to speech synthesizer based on DeepMind's paper
  • provide a pre-trained Tacotron-2 model (Training.. checking this still)

Model

Feature Prediction Model (Done)

  • Convolutional-RNN encoder block
  • Autoregressive decoder
  • Location Sensitive Attention (+ smoothing option)
  • Dynamic stop token prediction
  • LSTM + Zoneout
  • reduction factor (not used in the T2 paper)

Wavenet vocoder conditioned on Mel-Spectrogram (Done)

  • 1D dilated convolution
  • Local conditioning
  • Global conditioning
  • Upsampling network (by transposed convolutions)
  • Mixture of logistic distributions
  • Gaussian distribution for waveforms modeling
  • Exponential Moving Average (train + synthesis)

Scripts

  • Feature prediction model: training
  • Feature prediction model: natural synthesis
  • Feature prediction model: ground-truth aligned synthesis
  • Wavenet vocoder model: training (ground truth Mel-Spectrograms)
  • Wavenet vocoder model: training (ground truth aligned Mel-Spectrograms)
  • Wavenet vocoder model: waveforms synthesis
  • Global model: synthesis (from text to waveforms)

Extra (optional):

  • Griffin-Lim (as an alternative vocoder)
  • Reduction factor (speed up training, reduce model complexity + better alignment)
  • Curriculum-Learning for RNN Natural synthesis. paper
  • Post processing network for Linear Spectrogram mapping
  • Wavenet with Gaussian distribution (reference)

Notes:

All models in this repository will be implemented in Tensorflow on a first stage, so in case you want to use a Wavenet vocoder implemented in Pytorch you can refer to this repository that shows very promising results.

@Rayhane-mamah
Copy link
Owner Author

Rayhane-mamah commented Mar 4, 2018

Just putting some notes about the last commit (7e67d8b) to explain the motivation behind such major changes and to verify with the rest of us that I didn't make any silly mistakes ( as usual.. )

This commit mainly had 3 goals: (other changes are minor)

  • Clean the code: added some comments and changed the code architecture to use tensorflow's attention wrapper in the objective of reducing the number of files. Even though I tried getting rid of the "custom_decoder" and the custom "dynamic_decode" I'm currently using, after diving deep into tensorflow's implementation, I found that it was impossible to adapt my dynamic <stop_token> prediction to use tensorflow's ready to use "BasicDecoder" and "dynamic_decode" with my custom helpers.
  • Correct the Attention: Even though they called it "Location sensitive attention" in the paper, they didn't mean the "location based attention" we know but instead, they were mentioning the "hybrid" attention. For this hypothesis i'm relying on this part of the paper "We use the location sensitive attention from [21], which extends the additive attention mechanism [22] to use cumulative attention weights from previous decoder time steps as an additional feature." which supposes they took the original bahdanau attention and added location features to it."
  • Added "map" (log) file at synthesis: Mainly, this will map each input sequence, to the corresponding real Mel-Spectrograms and generated ones.

I also want to bring attention to these few points (in case someone want to argue them):

  • I impute finished sequences at decoding time to ensure model doesn't have to learn to predict paddings (which will probably result in extra noise in generated waveforms later)
  • If I'm not mistaking, the paper writers used the projection to a scalar+sigmoid to explicitly predict a "<stop_token>" probability since our feature prediction model isn't performing some classification task where he can choose to output a real <stop_token>. I like to think of it as creating a small binary classifier that chooses when to stop decoding since vanilla decoder can't output a frame with full round zeros.
  • I am only using 512 LSTM units for decoder in each layer as i supposed "The pre-net output and attention context vector are concatenated and passed through a stack of 2 uni-directional LSTM layers with 1024 units." means that the 1024 are distributed across the 2 layers.

@imdatceleste
Copy link

Hi @Rayhane-mamah, using 7e67d8b I got an error that (in the end). You change he call -parameter name from previous_alignments to state in attention.py:108

Was that on purpose? AttentionWrapper from TF requires the parameter to be named previous_alignments. (Using TF 1.4)

Changing that back to previous_alignments results in other errors:

ValueError: Shapes must be equal rank, but are 2 and 1 for 'model/inference/decoder/while/BasicDecoderStep/decoder/output_projection_wrapper/output_projection_wrapper/concat_lstm_output_and_attention_wrapper/concat_lstm_output_and_attention_wrapper/multi_rnn_cell/cell_0/cell_0/concat_prenet_and_attention_wrapper/concat_prenet_and_attention_wrapper/attention_cell/MatMul' (op: 'BatchMatMul') with input shapes: [2,1,?,?], [?,?,512].

Any ideas?

@Rayhane-mamah
Copy link
Owner Author

Rayhane-mamah commented Mar 5, 2018

Hi @imdatsolak, thanks for reaching out.

I encountered this problem on one of my machines, updating tensorflow to latest version solved the problem. (I changed the parameter to state according to latest tensorflow attention wrapper source code, I also want to point out that I am using TF 1.5 and confirm that attention wrapper works with "state" for this version and later).

Try updating tensorflow and keep me notified, I'll look into it if the problem persists.

@imdatceleste
Copy link

imdatceleste commented Mar 6, 2018

@Rayhane-mamah, I tried with TF 1.5, which didn't work. Looking into TF 1.5, the parameter was still called previous_alignments. The parameter's name changed in TF 1.6 to state, so installed TF 1.6 and it works now. Thanks!

@danshirron
Copy link

Upgrading to TF1.6 (was 1.5) solved issue (TypeError: call() got an unexpected keyword argument 'previous_alignments') for me.

@Rayhane-mamah
Copy link
Owner Author

@imdatsolak, yes my bad. @danshirron is perfectly right. I checked that my version is 1.6 too (i don't remember updating it Oo)

@Rayhane-mamah
Copy link
Owner Author

Rayhane-mamah commented Mar 10, 2018

Quick notes about the latest commit (7393fd5):

  • Corrected parameters initialization which was causing gradients explosion in some cases (using a xavier initializer)
  • Added gradients norm visualization
  • Replaced the learning rate decay to start from step 0 (instead of 50000) and added a visualization of the learning rate
  • Corrected typos in "hparams.py"
  • Changed alignment plot directories and added real + predicted Mel-Spectrogram plots (each 100 training step)
  • Added a small jupyter notebook where you can use griffin-lim to reconstruct phase and listen to the audio reconstructed from generated Mel-spectrograms (just to control the model learning state without paying much attention to audio quality as we will use wavenet as a vocoder)
  • Started using a reduction factor (despite not being used in tacotron-2) as it speeds training process (faster computation) and allows for faster alignment learning. (current: r=5, feel free to change it).
  • Corrected typos in preprocessing (Make sure to restart the preprocessing before training your next model)

Side notes:

  • Alignment should appear at step 15k and audio becomes quite audible at 4~5k steps (using 32 batch size) but fully understandable around 8~10k steps.
  • Mel-spectrograms seem very blurry at the beginning, and despite the loss not decreasing much (you may even feel it's constant after 1k steps) the model will still learn to improve speech quality so be patient.

If there are any problems, please feel free to report them, I'll get to it as fast as possible

@Rayhane-mamah
Copy link
Owner Author

Rayhane-mamah commented Mar 15, 2018

Quick review of the latest changes (919c96a):

  • Global code reorganization (for easier modifications and it's just cleaner now)
  • Network Architecture review: Since there are some unclear points in the paper, I am doing my best to collect enough information from all related works, and trying to put them all together to get reasonable results. The current architecture is the closest I got to the described T2 (I think.. ^^')
  • Pulled the <stop_token> prediction out of the decoder and got rid of the custom "dynamic_decode".
  • Reduced the model size and added new targets (stop token targets are now prepared in the feeder)
  • Adapted <stop_token> prediction to work properly with the reduction factor. (multiple <stop_token> predictions at each decoding step)
  • Doubled the number of LSTM units in the decoder and number of neurones in the prenet. On the other hand, I removed the separate attention LSTM and started using the first decoder LSTM hidden state as a query for the attention.

Side Notes:

  • Despite slightly reducing the memory usage of the model, impact of training speed are still not clear enough. Forward propagation got slightly faster and back propagation slightly slower. But the overall speed seems the same.

If anyone tries to train the model, please think about providing us with some feedback. (especially if the model needs improvement)

@ohleo
Copy link

ohleo commented Mar 15, 2018

Hi @Rayhane-mamah, thanks for sharing your work.

I cannot get a proper mel-spectrogram prediction and audible wave by Evaluation or Natural synthesis(No GTA) at step 50k.
All hparams are same with your code(with LJSpeech DB) and wave are generated by mel-prediction, mel_to_linear, Griffin-Lim reconstruction.
GTA synthesis generates audible results.

Is it works in your experiments?

I attached some Mel-spectrogram plot samples with following sentences.

1 : “In Dallas, one of the nine agents was assigned to assist in security measures at Love Field, and four had protective assignments at the Trade Mart."

Ground Truth
image

GTA
image

Natural(Eval)
image

2 : ”The remaining four had key responsibilities as members of the complement of the follow-up car in the motorcade."

Ground Truth
image

GTA
image

Natural(Eval)
image

3 : “Three of these agents occupied positions on the running boards of the car, and the fourth was seated in the car."

Ground Truth
image

GTA
image

Natural(Eval)
image

@Rayhane-mamah
Copy link
Owner Author

Rayhane-mamah commented Mar 15, 2018

Hello @ohleo, thank you for trying our work and especially for sharing your results with us.

The problem you're reporting seems the same as the one @imdatsolak mentionned here.

There are two possible reasons I can think of right now:

  • Your model after 50k steps still has an ugly allignment (hopefully this commit takes care of that). That's the most probable reason i think.
  • I am unknowingly and indefinitely passing the first frame to the decoder in my code. I will triple check this today ( in case TacotestHelper is the cause).
  • It can't be possibly doing a massive overfit on the first generated frame can it ? Oo the output looks the same for the three sentences!

The fact that GTA is working fine highly supposes the problem is in the helper.. I will report back to you later tonight.
If your setup is powerful enough, you could try to retrain the model using the latest commit or wait for me to test it myself a bit later this week.

In all cases, thanks a lot for your contribution, and hopefully we get around this issue soon.

@unwritten
Copy link

unwritten commented Mar 19, 2018

Hello, @Rayhane-mamah ,

do you get any further information running using the latest code?

@Rayhane-mamah
Copy link
Owner Author

Rayhane-mamah commented Mar 19, 2018

Hello @unwritten, thanks for reaching out.
I believe you asked about GTA as well? I'm just gonna answer it anyway in case anyone gets the same question.

GTA stands for Ground Truth Aligned. Synthesizing audio using GTA is basically using teacher forcing to help the model predict Mel-spectrograms. If you aim to use the generated spectrograms to train some vocoder like Wavenet or else, then this is probably how you want to generate your spectrograms for now. It is important to note however that in a fully end-to-end test case, you won't be given the ground truth, so you will have to use the "natural" synthesis where the model will simply look at its last predicted frame to output the next one. (i.e: with no teacher forcing)

Until my last commit, the model wasn't able to use natural synthesis properly and I was mainly suspecting the attention mechanism because, well, how is the model supposed to generate correct frames if it doesn't attend to the input sequence correctly.. Which brings us to your question.

So after a long week end of debugging, it turned out that the attention mechanism is just fine, and that the problem might have been with some Tensorflow scopes or w/e.. (I'm not really quite sure what was the problem). Anyway, after going back through the entire architecture, trying some different preprocessing steps and replacing zoneout LSTMs with vanilla LSTMs, the problem seems to be solved (Now I am not entirely 100% sure as I have not yet trained the model too far, but things seem as they should be in early stages of training).

I will update the repository in a bit (right after doing some cleaning), and there will be several references to papers used that the implementation was based on. These papers will be in pdf format in the "papers" folder, like that it's easier to find if you want to have an in depth look of the model.

I will post some results (plots and griffin lim reconstructed audio) as soon as possible. Until then, if there is anything else I can assist you with, please let me know.

Notes:

  • It is possible for now to use the griffin lim algorithm (using the provided notebook) to do a basic inversion of the mel spectrogram to waveform. The quality won't be as good as Wavenet's but it's mainly for test and debug purposes for now.
  • Generated spectrograms using the "synthesize.py" will be stored under "output" folder. Depending on the synthesis mode you used, there will many possible sub-folders.
  • I have not yet added the Wavenet vocoder to this repository as there are more important things at the moment like ensuring a good spectrograms generation. There are good Wavenet implementations out there that are conditioned of Mel-spectrograms like r9y9/wavenet.

@Rayhane-mamah
Copy link
Owner Author

Hello again @unwritten.

As promised I pushed the commit that contains the rectifications (c5e48a0).

Results, samples and pretrained model will be coming shortly.

@PetrochukM
Copy link

PetrochukM commented Mar 20, 2018

@Rayhane-mamah

Results, samples and pretrained model will be coming shortly.

Trying to understand "shortly", do you think they'll be out today, next week or next month?

@Rayhane-mamah
Copy link
Owner Author

@PetrochukM, I was thinking more like next year.. that still counts as "shortly" I guess..

Enough messing around, let's say it will take a couple of days.. or a couple of weeks :p But what's important, it will be here eventually.

@imdatceleste
Copy link

Hi everybody, here is a new dataset that you can use to train Speech Recognition and Speech Synthesis: M-AILABS Speech Dataset. Have fun...

@unwritten
Copy link

@Rayhane-mamah thanks for the work;
I have tried to train the latest commit maybe before 81b657d, I pulled the code about 2 days ago, currently it run to about 4k, the align doesn't look like to be there, I will try the newest code though:
step-45000-pred-mel-spectrogram
step-45000-real-mel-spectrogram

step-45000-align

@Rayhane-mamah
Copy link
Owner Author

Hi @imdatsolak, thank you very much for the notification. I will make sure to try it out as soon as possible.

@unwritten, I experienced the same issue with the commit you're reporting.

If your really don't want to waste your time and computation power for failed tests, you could wait a couple of days (at best) or a couple of weeks (at worst) until I post a 100% sure to work model, semi-pretrained which you can train further for better quality (I don't have the luxury to train for many steps at the moment unfortunately).

Thank you very much for your contribution. If there is anything I can help you with or if you notice any problems, feel free to report back.

@maozhiqiang
Copy link

@Rayhane-mamah thanks for the work;
why Loss descends very quickly than tacotron1?

@Rayhane-mamah
Copy link
Owner Author

Hello @maozhiqiang, thank you for reaching out.

In comparison to Tacotron-1 which uses simple summed L1 loss function (or MAE), we use (in Tacotron-2) a summed L2 loss function (or MSE). (The sum is in both cases is of predictions before and after the postnet). I won't pay much attention to the average along batch here for simplicity.

Let's take a look at both losses: (h(xi) stands for the model estimation)

L1 = ∑i |yi−h(xi)|
L2 = ∑i (yi−h(xi))²

The L1 loss is typically computing the residual loss between your model's predictions and the ground truth and returning the absolute value as is. The L2 loss however squares this error for each sample instead of simply returning the difference loss.
Now consider that your model starts for an initial state t0 where weights and biases are initialized randomly. Naturally the first model output will be totally random which results in a high loss for L1 which is even more amplified by the square operation in L2 (I'm supposing the initial loss is greater than 1).
After a few steps of training, the model should start emitting outputs that are in range of the correct predictions (especially if your data is [0, 1] normalized like in our case, the model doesn't take long to start throwing outputs in that range). This can be detected by the blurry, yet seem close to real, spectrograms the model is providing each 100 steps.
At this stage, the L1 and L2 loss functions start showing very different values. Try taking a difference (yi - h(xi)) smaller than 1 and compute its square, naturally you will get an even smaller value. So, once the model starts giving outputs in the correct range, L2 loss is already very low in comparison to L1 loss which does not compute the square.

Note: Next the model will only have to improve the vocal patterns, which consist of small adjustments, which explains why the loss then starts decreasing very slowly.

So mainly, what I'm trying to point out here is that we are not using the same loss function as in Tacotron-1 which I believe is the main reason for such difference. However, there are some other factors like the difference in the model architecture, or even the difference in target itself. (In Tacotron-1, we predict both Mel-spectrogram and Linear spectrograms using the post-processing net).

I believe this answers your question? Thanks again for reaching out, if there is anything else I can assist you with, please let me know.

@maozhiqiang
Copy link

hello @Rayhane-mamah thanks for your detailed reply,
I started training with your code these days,
Here's my training figure
step-27000-align
step-27000-pred-mel-spectrogram
step-27000-real-mel-spectrogram
When I run more than one hundred thousand times, the difference between pred-mel and real-mel is still great,but loss More than 0.03 or smaller,
Is there any problem in this?
Look forward to your reply ,thank you

@a3626a
Copy link

a3626a commented Mar 23, 2018

Here is empirical evidence for @Rayhane-mamah 's reasoning.

default

Yellow line uses loss function of tacotron1, brown line uses loss function of tacotron2. Loss of brown is about square of loss of yellow. (and they intersect at 1.0!)

@a3626a
Copy link

a3626a commented Mar 23, 2018

Hello.
I'm working on Tacotron2, and worked based on Keithito's implementation. Recently, I am trying to move to your implementation for some reasons.

There is one fundamental difference between @Rayhane-mamah 's TacotronDecoderCell and tensorflow.contrib.seq2seq.AttentionWrapper which Keithito used. AttentionWrapper uses previous output (mel spectrogram) AND previous attention(= context vector), but yours only use previous outputs.

With my modified version of Keithito's impl can make proper alignment, but yours cannot (Or just your impl requires more steps to make good alignment). I suspect the above mentioned difference for this result.

(One strange behavior of your implementation is that the quality of synthesized samples on test set is quite good, though their alignments are poor. With Keithito's implementation, without proper alignment, test loss is really huge.)

Do you have any idea about this? (Which one is right, concatenating previous attention or not?)

@Rayhane-mamah
Copy link
Owner Author

Rayhane-mamah commented Mar 23, 2018

hello @maozhiqiang and @a3626a , Thank you for your contribution.

@maozhiqiang, The loss you're reporting is perfectly normal, actually the smaller the loss the better, which explains why the further your train your model the better the predicted Mel-spectrograms become.

The only apparent problem which is also reported by @a3626a, is that the actual state of the repository (the actual model) isn't able to capture a good alignment.

@maozhiqiang, alignments are supposed to look something like this:
step-25000-align

Now, @a3626a, about that repository comparison, I made these few charts to make sure we're on the same page, and to make it easier to explain (I'm bad with words T_T).

Please note that for simplicity purposes, the encoder outputs call, the <stop_token> prediction part and the recurrent call of previous alignments were not represented.
If your notice any mistakes, please feel free to correct me:

Here's my understanding on how keithito's Decoder works:
tacotron-1-decoder

The way I see it, he is using an extra stateful RNN cell to generate the query vector at each decoding step (I'm assuming this is based on T1 where 256-GRU is used for this purpose). He's using a 128-LSTM for this RNN.

As you stated, the last decoder step outputs are indeed concatenated with the previous context vector before feeding them to the prenet (this is automatically done inside Tensorflow's attention_wrapper).
Please also note that in the "hybrid" implementation keithito is using, he is not concatenating the current context vector with the decoder RNN output before doing the linear projection. (just pointing out another difference in the architecture).

Now, here's what my decoder looks like:
tacotron-2-decoder

In this chart, the blue and red arrows (and terms in equations) represent two different implementations I tried separately for the context vector computation. Functions with the same name in both graphs represent the same layers (look at the end of the comment for a brief explanation about each symbol).

The actual state of the repository is the one represented in blue. i.e: I use the last decoder RNN output as query vector for the context vector computation. I also concatenate the decoder RNN output and the computed context vector to form the projection layer input.

Now, after reading your comments (and thank you for your loss plot by the way), Two possible versions came to mind when thinking of your modified version to keithito's tacotron:

First and most likely one:
In case you used Tensorflow's attention_wrapper to wrap the entire decoder cell, then this chart should probably explain how your decoder is working:
tacotron-hypothesis-1-decoder

here I am supposing that you are using the previous context vector in the concatenation operations. (c_{i-1}) and then update your context vector at the end of the decoding step. This is what naturally happens if you wrap the entire TacotronDecoderCell (without the alignments and attention part) with Tensorflow's attention_wrapper.

Second but less likely one:
If however you did not make use of the attention_wrapper, and do the context vector computation right after the prenet, this is probably what your decoder is doing:
tacotron-hypothesis-2-decoder

This actually seems weird to me because we're using the prenet output as a query vector.. Let's say i'm used to provide RNN outputs as query vector for attention computation.

Is any of these assumptions right? Or are you doing something I didn't think of? Please feel free to share your approach with us! (words should do, no need for charts x) )

So, to wrap things up (so many wrapping..), I am aware that generating the query vector using an additional LSTM gives a proper alignment, I am however trying to figure out a way that doesn't necessarily use an "Extra" recurrent layer since it wasn't explicitly mentioned in T2 paper. (and let's be honest, I don't want my hardware to come back haunting me when it gets tired of all this computation).

Sorry for the long comment, below are the symbols explained:

  • p() is a multi-layered non-linear function (prenet)
  • e_rec() stands for Extra Recurrency (attention LSTM)
  • Attend() is typically the attention network (refer to (content+location) attention paper for developed formulas)
  • rec() is the decoder Recurrency (decoder LSTM)
  • f() is a linear transformation
  • p_y_{i}, s_{i}, es_{i}, y_{i}, a_{i} and c_{i} are the prenet output, decoder RNN hidden state, attention RNN hidden state, decoder output, alignments and context vector respectively (all at the i-th step).
  • h is the encoder hidden states (encoder outputs)

Note:
About the quality of synthesized samples on test set, I am guessing you're referring to the GTA synthesis? It's a little bit predictable since GTA is basically a 100% teacher forcing synthesis (we provide the true frame instead of the last predicted frame at each decoding step). Otherwise, (for natural synthesis), the quality is very poor without alignment.

@a3626a
Copy link

a3626a commented Mar 24, 2018

Most of all, thank you for your reply with nice diagrams.

  1. About quality of samples on test set.
    Though I have not tested, you are probably right. Teacher forcing was enabled in my system.

  2. About my implementation
    My implementation's structure is almost identical to Keithito's. I mean 'modified' for adding more regularization methods, speaker embedding, different language with different dataset.

  3. My future approach
    I will follow your direction, getting rid of extra recurrent layer for attention mechanism. In my opinion, 2-layer decoder LSTMs can do the job of extra recurrent layer. I think what to feed into _compute_attention is the key, which is not clear in the paper. (Like you did, as red arrow and blue arrow)
    For the start, I will feed 'previous cell state of first decoder LSTM cell'. There are 2 reason for this choice. First, I am expecting the first LSTM cell to work as an attention RNN. Second, it seems like better to feed cell state, not hidden state(output). Because, it does not require unnecessary transformations of information. In other words, hidden state(output) of LSTM cell would be more like spectrogram, not phonemes, so this must be converted back into phoneme-like data to calculate energy(or score). In contrast cell state can have phoneme-like data which can be easily compared to encoder outputs (phonemes)

@Rayhane-mamah
Copy link
Owner Author

Rayhane-mamah commented Mar 24, 2018

Hello again and thank you for your answers.

"Speaker embedding" sounds exciting. I'm looking forward to hearing some samples once you're done making it!

About the attention, this is actually a nice interpretation! I can't test it out right now but I will definitely do! If you do try it out please feel free to share your results with us.

Thanks again for your contributions!

@a3626a
Copy link

a3626a commented Mar 24, 2018

I'm testing feeding 'previous cell state of first decoder LSTM cell', I will share the result after 1-2 days.

Thank you.

@r9y9
Copy link
Contributor

r9y9 commented Mar 25, 2018

Wow, nice thread;) I will follow the discussion here and would like to look into your code. Thank you for sharing your work!

@atreyas313
Copy link

Hi @Rayhane-mama,
what is the reason of preemphasis remove in tacotron 2?

@ishansan38
Copy link

Hey @Rayhane-mamah Amazing work with the repo!
I was wondering if you could provide the pretrained model so I could quickly evaluate results on my end.

Thanks!

@atreyas313
Copy link

hi,
after train and synthesize the Tacotron model, for train the Wavenet model has occurred OOM error.
information about my GPU:
GTX 1080, with 8G memory. Is possible for you to say the wavenet's training based on the GPU(Along with RAM)?

@ghost
Copy link

ghost commented Jun 13, 2018

hi, @Rayhane-mamah
I do not know why the WAV should be rescaled during the preprocessing procedure, in this way:

if hparams.rescale:
		wav = wav / np.abs(wav).max() * hparams.rescaling_max

Could you tell me why, or is there any keyword I could search via Google? Thank you so much.

@DanRuta
Copy link

DanRuta commented Jun 13, 2018

Hi @Rayhane-mamah. Having the an issue similar to @osungv , above.

I ran python train.py --model='Both', after pre-processing the LJ dataset, and the Tacotron model trained fine, the gta/map.txt was generated, but the InvalidArgumentError: Conv2DCustomBackpropFilterOp only supports NHWC error arises when it reaches the Wavenet training stage, and on subsequent Wavenet only runs.

I've made no changes to the (latest) code, and I've ensured that the GPU is used.

I did change outputs_per_step to 5, in hparams.py as my 8GB of GPU memory wasn't enough, and I saw this suggestion somewhere.

GPU: GTX 1080
CPU: 7700k
RAM: 32GB
tensorflow-gpu version: 1.8.0
Running on Windows

Any ideas?

@jgarciadominguez
Copy link

We have trained 200.000 steps with a small corpus. Not good result, but the weird thing is that everytime we syntetize the result is slightly different.

python3 synthesize.py --model='Tacotron' --mode='eval' --hparams='symmetric_mels=False,max_abs_value=4.0,power=1.1,outputs_per_step=1' --text_to_speak='this is a test'

and hparams
Hyperparameters: allow_clipping_in_normalization: True attention_dim: 128 attention_filters: 32 attention_kernel: (31,) cleaners: english_cleaners cumulative_weights: True decoder_layers: 2 decoder_lstm_units: 1024 embedding_dim: 512 enc_conv_channels: 512 enc_conv_kernel_size: (5,) enc_conv_num_layers: 3 encoder_lstm_units: 256 fft_size: 1024 fmax: 7600 fmin: 125 frame_shift_ms: None griffin_lim_iters: 60 hop_size: 256 impute_finished: False input_type: raw log_scale_min: -32.23619130191664 mask_encoder: False mask_finished: False max_abs_value: 4.0 max_iters: 2500 min_level_db: -100 num_freq: 513 num_mels: 80 outputs_per_step: 1 postnet_channels: 512 postnet_kernel_size: (5,) postnet_num_layers: 5 power: 1.1 predict_linear: False prenet_layers: [256, 256] quantize_channels: 65536 ref_level_db: 20 rescale: True rescaling_max: 0.999 sample_rate: 22050 signal_normalization: True silence_threshold: 2 smoothing: False stop_at_any: True symmetric_mels: False tacotron_adam_beta1: 0.9 tacotron_adam_beta2: 0.999 tacotron_adam_epsilon: 1e-06 tacotron_batch_size: 2 tacotron_decay_learning_rate: True tacotron_decay_rate: 0.4 tacotron_decay_steps: 50000 tacotron_dropout_rate: 0.5 tacotron_final_learning_rate: 1e-05 tacotron_initial_learning_rate: 0.001 tacotron_reg_weight: 1e-06 tacotron_scale_regularization: True tacotron_start_decay: 50000 tacotron_teacher_forcing_ratio: 1.0 tacotron_zoneout_rate: 0.1 trim_silence: True use_lws: True Constructing model: Tacotron

Any ideas why this could be happening?
TEST 1:
speech-mel-00001
speech-alignment-00001
TEST 2:
speech-mel-00001_1
speech-alignment-00001_1

@atreyas313
Copy link

Hi,
I trained network using "python train.py --model='Both' " then tried to synthesis from checkpoints using "python synthesize.py --model='Tacotron-2' " and happen this error:
DataLossError (see above for traceback): file is too short to be an sstable
[[Node: model_1/save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_model_1/save/Const_0_0, model_1/save/RestoreV2/tensor_names, model_1/save/RestoreV2/shape_and_slices)]]
[[Node: model_1/save/RestoreV2/_493 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_498_model_1/save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Can anyone guide me how can I fix this error?

@hadaev8
Copy link

hadaev8 commented Jul 13, 2018

@ben-8878
Copy link

ben-8878 commented Jul 23, 2018

@atreyas313 I met same error, do you have solve it?
I train model with ‘Tacotron-2’ and test with 'Both' as the author said, but still met the error.

@Rayhane-mamah
Copy link
Owner Author

@jgarciadominguez results are different everytime because we keep decoder prenet dropout active even during synthesis. As for the quality, your batch size if very small during training which causes the model to not learn how to align, thus the bad quality. Don't use smaller batch size than 32, it's okey to use outputs_per_step=3 for that purpose.

@atreyas313 and @v-yunbin this "Both" option is causing everyone problems it seems, I took it off so please make sure to train with "Tacotron-2" instead (it will train Tacotron+Wavenet)

@hadaev8 I thought about using that for faster computation but didn't really spend much time trying to apply zoneout on it, if you get any success with it let me know :)

This is a sample of the wavenet from last commit on M-AILABS mary_ann:
wavenet-northandsouth_01_f000005.wav.tar.gz

Because all objectives of this repo are now done, I will close this issue and make a tour on all issues open to answer most of them this evening. The pretrained models and samples will be updated soon in the README.md. If any problems persist, feel free to open issues.

@puneet-kr
Copy link

puneet-kr commented Aug 14, 2018

Hi there, first of all, I'm thankful for this code.

I'm a beginner and trying to run it. With the parallelization implemented in datasets/preprocessor.py, I'm getting this error:
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Can somebody please convert this code to serial implementation:

executor = ProcessPoolExecutor(max_workers=n_jobs)
futures = []
index = 1
for input_dir in input_dirs:
with open(os.path.join(input_dir, 'metadata.csv'), encoding='utf-8') as f:
for line in f:
parts = line.strip().split('|')
basename = parts[0]
wav_path = os.path.join(input_dir, 'wavs', '{}.wav'.format(basename))
text = parts[2]
futures.append(executor.submit(partial(_process_utterance, mel_dir, linear_dir, wav_dir, basename, wav_path, text, hparams)))
index += 1
return [future.result() for future in tqdm(futures) if future.result() is not None]

I understood that i. __process_utterance (out_dir, index, wav_path, text) needs to be called for every input. But I couldn't yet understand how to modify this statement:

return [future.result() for future in tqdm(futures) if future.result() is not None]

@anushaprakash90
Copy link

Hi Rayhane,

I am running this code for the first time. I am training the tacotron-2 model using the LJSpeech dataset. The model is training without any issues, but on the cpu and not the gpus (checked with nvidia-smi). Is there anything that needs to be specified explicitly so that training can be done on the gpus?

@Rayhane-mamah
Copy link
Owner Author

Rayhane-mamah commented Oct 1, 2018 via email

@ishandutta2007
Copy link

@Rayhane-mamah Is the pretrained model ready ?

@lucasjinreal
Copy link

@Rayhane-mamah Does there any sample radios inference with trained model? Does that sounds well or not?

@hxs7709
Copy link

hxs7709 commented Oct 18, 2018

@Rayhane-mamah Thank you very much for the great repository, I like it. I see the clarinet paper and some code change are committed to our repository on 10.7. Does it mean it support clarinet now? Or do you have plan to support clarinet?

@hxs7709
Copy link

hxs7709 commented Oct 19, 2018

@Rayhane-mamah
In README.md, we should add --mode='synthesis' to following command, otherwise we still run into eval mode because mode's default value is 'eval'.
Please help to check double-check it.
python synthesize.py --model='Tacotron' --GTA=True # synthesized mel spectrograms at tacotron_output/gta
python synthesize.py --model='Tacotron' --GTA=False # synthesized mel spectrograms at tacotron_output/natural

In addition, the wavenet_preprocess.py directly get mel spectrograms at tacotron_output/gta for training data. What is the difference of two mel spectrograms from following two commands?
They can both be used to separately training wavenet.
python synthesize.py --model='Tacotron' --GTA=True
python wavenet_preprocess.py

Thank you.

@anushaprakash90
Copy link

@Rayhane-mamah Thanks. After installing only tensorflow-gpu, I am able to run the code on the GPUs. I am now using the the updated scripts of Tacotron. When I run the code, it is running on all the available GPUs, but gives a segmentation fault just before training. This is perhaps a memory issue. I am trying to run the code on a single GPU. As mentioned in the hparams.py file, I have set num_gpus=0 and tacotron_gpu_start_idx appropriately. However, I am getting the following error:

ValueError: Attr 'num_split' of 'Split' Op passed 0 less than minimum 1.

Traceback:
File "/speech/anusha/Tacotron-2/tacotron/models/tacotron.py", line 61, in initialize
tower_input_lengths = tf.split(input_lengths, num_or_size_splits=hp.tacotron_num_gpus, axis=0)

This requires tacotron_num_gpus to be set to at least 1 so that it can recognize that a GPU is available. Should I modify any other parameter, etc.?

Thanks

wuzl pushed a commit to wuzl/Tacotron-2 that referenced this issue Nov 29, 2018
@ishandutta2007
Copy link

ishandutta2007 commented Dec 3, 2018

6k4 steps

@Rayhane-mamah: Do you mean 60000 steps or 6000 steps ?

@mrgloom
Copy link

mrgloom commented Apr 1, 2019

What is the last model compatible with master?

@Arafat4341
Copy link

Arafat4341 commented Jan 10, 2020

Hi @Rayhane-mamah! Can we train using jsut data?
I found no support for pre-processing jsut(japanese text corpus) data. That's why I tried to use deepvoice3's preprocess module to prepare data and then copied that folder inside cloned repo of tacotron2 and named it 'training_data'. Then I ran train.py module and I am getting this error:

`Traceback (most recent call last):

File "train.py", line 138, in
main()
File "train.py", line 132, in main
train(args, log_dir, hparams)
File "train.py", line 52, in train
checkpoint = tacotron_train(args, log_dir, hparams)
File "/content/drive/My Drive/Tacotron-2/tacotron/train.py", line 399, in tacotron_train
return train(log_dir, args, hparams)
File "/content/drive/My Drive/Tacotron-2/tacotron/train.py", line 152, in train
feeder = Feeder(coord, input_path, hparams)
File "/content/drive/My Drive/Tacotron-2/tacotron/feeder.py", line 33, in init
hours = sum([int(x[4]) for x in self._metadata]) * frame_shift_ms / (3600)
File "/content/drive/My Drive/Tacotron-2/tacotron/feeder.py", line 33, in
hours = sum([int(x[4]) for x in self._metadata]) * frame_shift_ms / (3600)
IndexError: list index out of range`

Can you help!!

@NiklasHoltmeyer
Copy link

is this project dead? are there any pretrained models?

@1-800-BAD-CODE
Copy link

is this project dead? are there any pretrained models?

I'm not an author of this repository but I would agree that this repo is inactive and outdated. TTS is a very active area of research and production and there are modern alternatives that are very active, e.g., NeMo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests