-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to get the same mel feature in "metadata.pkl"? #84
Comments
No I didn't. |
So why the first dimension is not the same? And I use the mel feature whose feature is (385, 80) and your model, your speaker embedding in "metadata.pkl" to generate the audio "p225xp228", but only generate 6s strange voice. I cannot hear the word "please call stella". So how you reduce the dimension from 385 to 90? |
The first dim is the number of frames. There is no dimension reduction. It should be around 90 for that utterance. Please double-check your code. |
I use your code and your parameter in issue #4 to generate the mel feature, the hop_size is 256 and the result dimension is (385, 80). The code is below. If there has some bug, please point it, thanks! import os def butter_highpass(cutoff, fs, order=5): def pySTFT(x, fft_length=1024, hop_length=256): mel_basis = mel(16000, 1024, fmin=90, fmax=7600, n_mels=80).T dirName = '../dataset/VCTK-Corpus/wav48' print(S.shape) |
The sampling rate should be 16k instead of 48k |
Thank you! |
I have another question: I use the following code to replace the soundfile to read the data x, fs = librosa.load(os.path.join(dirName, subdir, fileName), sr=16000) However, the final dimension is (129, 80) still not the (90, 80) |
|
This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it. |
Do you mean that you used the VCTK dataset whose sr is 48k, then you downsampled them to 16k, by which you get the (90, 80) Mel-spectrogram to convert? Well, that's very strange, the author's provided little training samples(in "\wav\p225\p225_003.wav") owns 16k sr, but get the (376, 80). And further, have you ever tried to converted voice by your own "metadata.pkl"? If you have, could you please give me some advice? I'm a freshman in VC thus don't know much about how to build my model. Thank you again! |
I don not get the shape (90, 80), I get the shape (129, 80) instead. The complete code is replace the code "x, fs = sf.read(os.path.join(dirName, subdir, fileName))" to "x, fs = librosa.load(os.path.join(dirName, subdir, fileName), sr=16000)". And you will get my result if you use the same dataset downloaded from the link I had given. |
I get shape (129, 80) as well. Any update on this? |
The length does not have to be 90. As long as the sampling frequency is correct, it should be fine. |
Many thanks for your prompt reply. Unfortunately, I noticed that the audio quality is not as good. Is there any chance you used a particular procedure for downsampling to 16kHz? Or maybe you performed some preprocessing while downsampling? Thanks |
No. and the procedures for downsampling should not make a big difference. |
The reason why I thought about some additional preprocessing is that by analysing the spectrograms I noticed some differences between the original dataset and your version. Below is the spectrogram that I computed starting from the original dataset, downsampling to 16kHz, and applying make_spect.py (shape 119*80) Below is the spectrogram for p225_001 that you included in metadata.pkl (shape 90*80) Below is the spectrogram that I computed starting from the file that you host on the demo page (https://auspicious3000.github.io/autovc-demo/audios/ground_truth1/p225_001.wav), downsampling to 16kHz (originally at 22050Hz), and applying make_spect.py (shape 90*80) I don't understand why your files produce almost identical spectrograms, while if we start from the original dataset we get significantly different results. The audio quality is affected as well: "p225xp225 (8).wav" is the audio generated by the original dataset Do you have any idea of what could be the difference between your files and the files in the original dataset? |
I finally found that the difference is the trimming at the head and tail of the audio. I reproduced an almost identical file by "trimming it by hand", but I couldn't find the exact silence trimming procedure that you used. |
OK. That explains it. I trimmed the silence off by hand. |
@auspicious3000 You mean you trimmed the silence part off from whole VCTK dataset by hand to generate your training dataset and train the model? |
I use your default parameter and code to compute the mel feature of "p225_001.wav" in VCTK corpus. However, I get the dimension of mel feature is (385, 80) not the dimension (90 ,80) in "metadata.pkl". Do you have extra processing steps?
The text was updated successfully, but these errors were encountered: