Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planned TODOs #1

Closed
24 of 28 tasks
r9y9 opened this issue Dec 31, 2017 · 98 comments
Closed
24 of 28 tasks

Planned TODOs #1

r9y9 opened this issue Dec 31, 2017 · 98 comments

Comments

@r9y9
Copy link
Owner

r9y9 commented Dec 31, 2017

This is an umbrella issue to track progress for my planned TODOs. Comments and requests are welcome.

Goal

  • achieve higher speech quality than conventional vocoder (WORLD, griffin-lim, etc)
  • provide pre-trained model of WaveNet-based mel-spectrogram vocoder

Model

  • 1D dilated convolution
  • batch forward
  • incremental inference
  • local conditioning
  • global conditioning
  • upsampling network (by transposed convolutions)

Training script

  • Local conditioning
  • Global conditioning
  • Configurable maximum number of time steps (to avoid out of memory error). 58ad07f

Experiments

  • unconditioned WaveNet trained with CMU Arctic
  • conditioning model on mel-spectrogram (local conditioning) with CMU Arctic
  • conditioning model on mel-spectrogram and speaker id with CMU Arctic
  • conditioning model on mel-spectrogram (local conditioning) with LJSpeech
  • DeepVoice3 + WaveNet vocoder WIP: Support for Wavenet vocoder deepvoice3_pytorch#21

Misc

  • [ ] Time sliced data generator?
  • Travis CI
  • Train/val split
  • README

Sampling frequency

  • 4kHz
  • 16kHz
  • 22.5kHz
  • 44.1kHz
  • 48kHz

Advanced (lower priority)

@r9y9
Copy link
Owner Author

r9y9 commented Dec 31, 2017

At the moment, I think I finished to implement basic features (batch/incremental inference, local/global conditioning) and confirmed that unconditioned WaveNet trained on CMU Arctic (~1200 utterances, 16kHz) can generate sounds like speech. Audio samples are attached.

step80000.zip

top: real speech, bottom: generated speech. The only first one sample of real-speech was fed to the WaveNet decoder as an initial input.

step000080000_waveplots

step90000.zip

step000090000_waveplots

@geneing
Copy link

geneing commented Dec 31, 2017

For reference, these are other wavenet projects I know of:
https://github.com/ibab/tensorflow-wavenet
https://github.com/tomlepaine/fast-wavenet - faster version of the original wavenet paper.

@r9y9
Copy link
Owner Author

r9y9 commented Jan 1, 2018

Still not quite high quality, but vocoder conditioned on mel-spectrogram started to work. Audio samples from a model trained 10 hours are attached.

step90000.zip
step000090000_waveplots

step95000.zip
step000095000_waveplots

@r9y9
Copy link
Owner Author

r9y9 commented Jan 2, 2018

Finished transposed convolution support at 8c0b5a9. Started training again.

@jamestang0219
Copy link

Hi, I've already tried to use linguistic features as local features, but I found there might be a problem that linguistic features are based on phoneme class, mel-specs are based on frame class, but the local features of wavenet inputs are based on sample point class.

Here is a case, if a phoneme's duration is 0.25s, and its sample rate is 16k, in order to create the wavenet inputs, I have to duplicate the single phoneme's linguistic feature to int(0.25 * 16000) times as their samples' local features. Do you think my practice is right or not? How do you process the mel-spec features while they are frame class?

Thanks for answering me.

@jamestang0219
Copy link

Wavenet can capture the differences even if many samples' local features are same as long as its receptive field is wide?

@r9y9
Copy link
Owner Author

r9y9 commented Jan 3, 2018

@jamestang0219 I think you are right. In the paper http://www.isca-speech.org/archive/Interspeech_2017/pdfs/0314.PDF, they use log-f0 and mel-cepstrum as conditional features and duplicate them to adjust time resolution. I also tried this idea and got reasonable result.

@r9y9
Copy link
Owner Author

r9y9 commented Jan 3, 2018

Latest audio sample attached. Mel-spectrogram are repeated to adjust time resolution. See

wavenet_vocoder/audio.py

Lines 39 to 40 in b8ee2ce

upsample_factor = quantized.size // mel.shape[0]
mel = np.repeat(mel, upsample_factor, axis=0)
. In this case upsample_factor was always 256.

step70000.zip
step000070000_waveplots

@jamestang0219
Copy link

@r9y9 In your source code, you use transposed convolution to implement the upsample process? Have you ever checked which method is better for upsampling?

@r9y9
Copy link
Owner Author

r9y9 commented Jan 3, 2018

@jamestang0219 I implemented transposed convolution but haven't got success yet. I wonder 256x upsampling is hard to train, especially for small dataset which I'm experimenting with now. WaveNet authors reported transposed convolution is better, though.

@r9y9
Copy link
Owner Author

r9y9 commented Jan 3, 2018

# If True, use transposed convolutions to upsample conditional features,
# otherwise repeat features to adjast time resolution
upsample_conditional_features=False,
# should np.prod(upsample_scales) == hop_size
upsample_scales=[16, 16],

For now I am not using transposed convolution.

@jamestang0219
Copy link

@r9y9 May I know your hyper parameters for extracting mel spectrogram? Frame shift is 0.0125s and frame width is 0.05s? If this is your parameters, but why you use 256 as the upsample factor instead of sr(16000) * frame_shift(0.0125) = 200? Any tricks here? Forgive me for many questions:( because I also wanna reproduce tacotron2 result

@r9y9
Copy link
Owner Author

r9y9 commented Jan 3, 2018

@jamestang0219 Hyper parameters for audio parameter extraction:

# Audio:
sample_rate=16000,
silence_threshold=2,
num_mels=80,
fft_size=1024,
# shift can be specified by either hop_size or frame_shift_ms
hop_size=256,
frame_shift_ms=None,
min_level_db=-100,
ref_level_db=20,

I use frame shift 256 samples / 16 ms.

@jamestang0219
Copy link

@r9y9 Thanks:)

@npuichigo
Copy link

@r9y9 I notice that in Tacotron2, two upsampling layers with transposed convolution are used. But in my WaveNet implementation, it still can't work.

@r9y9
Copy link
Owner Author

r9y9 commented Jan 3, 2018

@npuichigo Could you share what parameters (padding, kernel_size, etc) you are using? I tried 1d transposed covolution with stride=16, kernel_size=16, padding=0 two times to upsample inputs to 256x.

if upsample_conditional_features:
self.upsample_conv = nn.ModuleList()
for s in upsample_scales:
self.upsample_conv.append(ConvTranspose1d(
cin_channels, cin_channels, kernel_size=s, padding=0,
dilation=1, stride=s, std_mul=1.0))
# Is this non-lineality necessary?
self.upsample_conv.append(nn.ReLU(inplace=True))

@npuichigo
Copy link

@r9y9 Parameters of mine are listed below. Because I use frame shift which is 12.5ms, upsampling factor is 200.

# Audio
num_mels=80,
num_freq=1025,
sample_rate=16000,
frame_length_ms=50,
frame_shift_ms=12.5,
min_level_db=-100,
ref_level_db=20

# Tranposed convolution 10*20=200 (tensorflow)
up_lc_batch = tf.expand_dims(lc_batch, 1)
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, self.out_channels, (1, 10),
       strides=(1, 10), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / self.out_channels))
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, self.out_channels, (1, 20),
       strides=(1, 20), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / self.out_channels))
up_lc_batch = tf.squeeze(up_lc_batch, 1)

@r9y9
Copy link
Owner Author

r9y9 commented Jan 4, 2018

https://r9y9.github.io/wavenet_vocoder/

Created a simple project page and uploaded audio samples for speaker-dependent WaveNet vocoder. I'm working on global conditioning (speaker embedding) now.

@npuichigo
Copy link

@r9y9 Regarding upsampling network, I found that 2D transposed convolution works well, while 1D version will generate speech with unnatural prosody, maybe because 2D transpose convolution only consider local information in frequency domain.

height_width = 3  # kernel width along frequency axis
up_lc_batch = tf.expand_dims(lc_batch, 3)
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, 1, (10, height_width),
       strides=(10, 1), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / height_width))
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, 1, (20, height_width),
       strides=(20, 1), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / height_width))
up_lc_batch = tf.squeeze(up_lc_batch, 3)

@r9y9
Copy link
Owner Author

r9y9 commented Jan 5, 2018

@npuichigo Thank you for sharing that! Did you check the output of the upsampling network? Could upsampling network actually learn upsampling? I mean, did you get high-resolution mel-spectrogram? I was wondering if I need to add loss term regarding upsampling (e.g., MSE between coarse mel-spectrogram and 1-shift high resolution mel-spectrogram) and I'm curious whether it could be learned without upsampling specific loss.

@npuichigo
Copy link

@r9y9 I think transposed convolution with same stride and kernel size is similar to duplicating. Like the following picture, if the kernel is one everywhere, then it's just duplicating. So maybe I need to check the values of kernel after training.
padding_no_strides_transposed_test_28

@r9y9
Copy link
Owner Author

r9y9 commented Jan 6, 2018

https://r9y9.github.io/wavenet_vocoder/

Added audio samples for multi-speaker version of WaveNet vocoder.

@rishabh135
Copy link

Hello @r9y9 , great work and awesome samples, would you mind sharing the weights of the network for the wavenet_vocoder trained on mel_spectrograms with CMU artic dataset without speaker embedding ? I would like to use and compare them with griffin-lim reconstruction to see which works better.

@r9y9
Copy link
Owner Author

r9y9 commented Jan 8, 2018

@rishabh135 Not at all. Here it is: https://www.dropbox.com/sh/b1p32sxywo6xdnb/AAB2TU2DGhPDJgUzNc38Cz75a?dl=0

Note that you have to use exactly same mel-spectrogram extraction

wavenet_vocoder/audio.py

Lines 66 to 69 in f05e520

def melspectrogram(y):
D = _lws_processor().stft(y).T
S = _amp_to_db(_linear_to_mel(np.abs(D)))
return _normalize(S)
and same hyper parameters
sample_rate=16000,
silence_threshold=2,
num_mels=80,
fft_size=1024,
# shift can be specified by either hop_size or frame_shift_ms
hop_size=256,
frame_shift_ms=None,
min_level_db=-100,
ref_level_db=20,

@r9y9
Copy link
Owner Author

r9y9 commented Jan 8, 2018

Using the transposed convolution below, I can get good initialization for the upsampling network. Very nice, thanks @npuichigo !

kernel_size = 3
padding = (kernel_size - 1) // 2
upsample_factor = 16

conv = nn.ConvTranspose2d(1,1,kernel_size=(kernel_size,upsample_factor),
                          stride=(1,upsample_factor), padding=(padding,0))
conv.bias.data.zero_()
conv.weight.data.fill_(1/kernel_size);

Mel-spectrogram (hop_size = 256)

download

16x upsampled mel-spectrogram

download 1

@r9y9
Copy link
Owner Author

r9y9 commented Feb 12, 2018

https://r9y9.github.io/wavenet_vocoder/

Update samples of multi-speaker WN. Used mixture of logistic distributions. It was quite costly to train.. Also added ground truth audio samples for ease of comparison.

@rafaelvalle
Copy link

@r9y9 what do you mean by costly do train? what are the biggest challenges?

@r9y9
Copy link
Owner Author

r9y9 commented Feb 15, 2018

I meant it's much time consuming. It took a week or more to get sufficient good quality for LJSpeech and CMU ARCTIC.

@rafaelvalle
Copy link

Can you share the loss curve?

@r9y9
Copy link
Owner Author

r9y9 commented Feb 15, 2018

I’m in a short business trip and do not have access to my GPU PC right now. I can share when I come back home after a week.

@rafaelvalle
Copy link

That's great, Ryuchi! Thank you@

@bliep
Copy link

bliep commented Feb 19, 2018

In the original Salimans pixel-cnn++ code the loss is converted to bits per output dimension which is actually quite handy for comparison with other implementations and experiments. For this just divide the loss by the dimensionality of the output * ln(2). How many bits is the model able to predict?

@rafaelvalle
Copy link

rafaelvalle commented Feb 19, 2018 via email

@bliep
Copy link

bliep commented Feb 19, 2018

The loss is the negative log probability, and averaged over the output dimension it is an estimate of the entropy in a sample. In the original paper (predicting pixels in an image) the residual entropy was around 3 bits (out of 8, so predicting 5 bits). Since it is not easy for me to figure out the output dimension of this wavenet implementation, a loss of 56-57 doesn't tell me much.
(see https://github.com/openai/pixel-cnn/blob/master/train.py#L148)

@rafaelvalle
Copy link

rafaelvalle commented Feb 20, 2018

I see now, it just the loss but normalized to bits, thus facilitating comparison as you mentioned!

From what I understand the model has 10 mixture of logistics with 3 params each (pi, mean, log-scale), producing a total of 30 channels.

This is what I understand from what @r9y9 has on the hparams.py file https://github.com/r9y9/wavenet_vocoder/blob/master/hparams.py

@et1234et
Copy link

@r9y9
I tested LJSpeech using latest code.(~1000K) but it slightly noisy..
Is same Latest code to updated sample setting? (https://r9y9.github.io/wavenet_vocoder/)
checkpoint_step001000000.zip

@r9y9
Copy link
Owner Author

r9y9 commented Feb 21, 2018

Yes, current master is the latest one and this is what I locally have. Maybe training procedure I described in #1 (comment) is important for quality.

@dyelax
Copy link
Contributor

dyelax commented Feb 21, 2018

@r9y9 would you mind re-sharing your weights for the mel-conditioned wavenet? The link you shared earlier is broken. Thanks!

@r9y9
Copy link
Owner Author

r9y9 commented Feb 22, 2018

@dyelax Can you check the links in #19 instead?

@azraelkuan
Copy link
Contributor

azraelkuan commented Feb 26, 2018

@r9y9 for multi gpu training, i test that we only need to fix

y_hat = model(x, c=c, g=g, softmax=False)
to y_hat = torch.nn.parallel.data_parallel(model, (x, c, g, False)) and increase the num_workers, batch_size
Also, we can set the device_ids and output_device for different cmd args

@bliep
Copy link

bliep commented Feb 27, 2018

Efficient Neural Audio Synthesis https://arxiv.org/abs/1802.08435
Lots of interesting tricks and the claim is real-time on a mobile cpu due to weight pruning.

@r9y9
Copy link
Owner Author

r9y9 commented Apr 6, 2018

https://github.com/r9y9/wavenet_vocoder#pre-trained-models Added link to pre-trained models.

@twidddj
Copy link

twidddj commented Apr 11, 2018

Hi @r9y9, Thank you so much for sharing your work.

We have followed yours and got some results in Tensorflow. While we have not many tested yet, It works in the same parameters as yours except without Dropout, WeightNorm techniques. You can find some results in here. If I get another information during testing, I'll let you know about it. Thanks!

@r9y9
Copy link
Owner Author

r9y9 commented Apr 11, 2018

@twidddj Nice! I'm looking forward to your results.

@r9y9
Copy link
Owner Author

r9y9 commented May 12, 2018

I think I can close this now. Discussion on remained issues (e.g, DeepVoice + WaveNet) can continue on specific issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

15 participants