-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with Kaldi MFCCs #328
Comments
If I understand, you are comparing four versions of mfcc:
You are saying that
Do you have a minimal code I could take a look at ? |
Quote to similar question #263 (comment) |
Here is my example of import torchaudio
import torch,numpy,random
random.seed(0)
numpy.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed(0)
torch.cuda.manual_seed_all(0)
# compute-mfcc-feats --verbose=2 --sample-frequency=8000 scp:data/wav.scp ark:- | copy-feats ark:- ark,scp:data/feats.ark,data/feats.scp
d = { u:d for u,d in torchaudio.kaldi_io.read_mat_scp('data/feats.scp') }
kaldi_feats = d['iaaa']
print(kaldi_feats, kaldi_feats.shape)
wav, rate = torchaudio.load('data/wav/iaaa.wav')
print(wav.shape, rate)
torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)
torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)
Kaldi are tensor All different |
Have you tried setting |
Hi, I also have problems with the MFCCs. When I compare the MFCCs generated by Kaldi with the ones generated by PyTorch, I get very similar results for all the coefficients except for the first one. For some reason, there is a difference between the kaldi and pytorch first coefficient of about 100. Still the pattern is the same, so I suspect that there is some kind of energy normalization going on. For reproducibility, in kaldi I use the s5 recipe of the librispeech example. I use the MFCCs generated at stage 6. The mfcc.conf file only has the use-energy flag set to false, all the other parameters of compute-mfcc-feats are default. The audio from the images below is the 1089-134686-0000.flac from test_clean set. To compare the features I use kaldiio to convert kaldi features into numpy:
For pytorch's features:
I set dither to 0 according to #157. If dither is a low number (0.1) the features are still very similar, but if dither is 1 then they are completely different, but this a separate problem. Why do I get this difference in the first coefficients? Is there some normalization step done inside pytorch's compliance library that is not done in kaldi? |
Update: I have gone back to the spectrogram level trying to find the bug. If I set the flag subtract_mean to true in both kaldi and pytorch, the resulting spectrogram is (almost) the same. But if I set is as false (which is default), the results are different: they have the same pattern but the mean is different. Kaldi code to generate spectrograms with mean subtraction: Pytorch code to generate spectrograms with mean subtraction: Kaldi code to generate spectrograms without mean subtraction: Pytorch code to generate spectrograms with mean subtraction: I suspect that there is something on the FFT computation that is normalizing in one but not in other. Any thoughts? |
Also, torchaudio.compliance.kaldi.mfcc doesn't produce the exact output as compute-mfcc-feats when htk_compat is False. When true:
and
they match, but set htk_compat to false:
Also I assume the default |
Hi,
thank you very much for this very useful project.
I started doing some speech recognition experiments with the MFCC features implemented in torchaudio. In particular, I tried the librosa ones implemented in torchaudio/transorms.py and the kaldi-ones implemented in torchaudio/compliance/kaldi.py.
The librosa features are computed very efficiently and I can achieve results similar to that of the original kaldi features when changing some hyperparameters (i.e, n_mfcc=13, hop_length=160,n_mels=23,f_min=80,f_max=7900).
When switching to the kaldi implemented features, however, my neural network doesn't even converge. I suspect there a bug somewhere. I tried to compare the original kaldi mfccs with the ones implemented in torchaudio and they look very different (dithering only cannot explain such a big difference):
The other issue is that the current version doesn't support cuda and it can only process up to two-channels at a time. Also, the current version is significantly slower than the librosa implementation (there could be a bottleneck somewhere).
Any idea?
Hope my feedback would be helpful
Thank you
Mirco
The text was updated successfully, but these errors were encountered: