Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with Kaldi MFCCs #328

Open
mravanelli opened this issue Nov 4, 2019 · 7 comments
Open

Problems with Kaldi MFCCs #328

mravanelli opened this issue Nov 4, 2019 · 7 comments

Comments

@mravanelli
Copy link

Hi,
thank you very much for this very useful project.

I started doing some speech recognition experiments with the MFCC features implemented in torchaudio. In particular, I tried the librosa ones implemented in torchaudio/transorms.py and the kaldi-ones implemented in torchaudio/compliance/kaldi.py.

  • The librosa features are computed very efficiently and I can achieve results similar to that of the original kaldi features when changing some hyperparameters (i.e, n_mfcc=13, hop_length=160,n_mels=23,f_min=80,f_max=7900).

  • When switching to the kaldi implemented features, however, my neural network doesn't even converge. I suspect there a bug somewhere. I tried to compare the original kaldi mfccs with the ones implemented in torchaudio and they look very different (dithering only cannot explain such a big difference):

mfcc_original
array([35.84189 , 39.748493, 35.40782 , 33.237488, 34.53969 , 35.40782 ,
       34.973755, 35.40782 , 35.40782 , 35.84189 ], dtype=float32)

mfcc_torch
tensor([29.3794, 29.1657, 28.7020, 27.4892, 29.1944, 27.8915, 29.3321, 28.8958, 28.4197,
29.0967])

The other issue is that the current version doesn't support cuda and it can only process up to two-channels at a time. Also, the current version is significantly slower than the librosa implementation (there could be a bottleneck somewhere).

Any idea?
Hope my feedback would be helpful

Thank you

Mirco

@vincentqb
Copy link
Contributor

vincentqb commented Nov 6, 2019

If I understand, you are comparing four versions of mfcc:

  1. mfcc from torchaudio/transforms.py
  2. mfcc from librosa
  3. mfcc from torchaudio/compliance/kaldi.py
  4. mfcc from kaldi

You are saying that

  • 1 and 4 are performing well and agree,
  • 3 does not converge and is very different from 4.
    Is that correct?

Do you have a minimal code I could take a look at ?

@vincentqb vincentqb self-assigned this Nov 18, 2019
@HsunGong
Copy link

HsunGong commented Dec 9, 2019

Quote to similar question #263 (comment)

@HsunGong
Copy link

HsunGong commented Dec 9, 2019

code I could take a look at ?

Here is my example of 3 and 4:

import torchaudio
import torch,numpy,random
random.seed(0)
numpy.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed(0)
torch.cuda.manual_seed_all(0)

# compute-mfcc-feats --verbose=2 --sample-frequency=8000  scp:data/wav.scp ark:- | copy-feats ark:- ark,scp:data/feats.ark,data/feats.scp
d = { u:d for u,d in torchaudio.kaldi_io.read_mat_scp('data/feats.scp') }
kaldi_feats = d['iaaa']
print(kaldi_feats, kaldi_feats.shape)

wav, rate = torchaudio.load('data/wav/iaaa.wav')
print(wav.shape, rate)

torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)
torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)
tensor([[ 18.3451, -14.9193, -18.3694,  ...,  -6.3691,   1.8752,  -8.8333],
        [ 20.3241, -11.0107, -16.5517,  ..., -10.2303,  -2.2465, -13.0228],
        [ 22.7282,   7.1452, -32.8558,  ..., -14.6897, -22.6369,  -8.8484],
        ...,
        [ 15.3191, -18.4647,   3.6274,  ..., -20.1052,   7.0780,  -6.4834],
        [ 15.3900, -19.9616,  -5.4611,  ..., -12.0642,   4.8870, -15.5243],
        [ 14.6114, -23.1458,  -7.1615,  ..., -31.9867,  -8.1553,  -8.3250]]) torch.Size([7249, 13])
torch.Size([1, 580080]) 8000
tensor([[ 25.4531, -28.9004,  -9.2195,  ...,   9.3991,   6.6678,  -0.3100],
        [ 23.8064, -26.9535,  -8.3300,  ...,  -8.8944,  -4.4637,   8.1744],
        [ 25.2671, -25.2465,  -9.5173,  ...,   1.7179,   5.4729,  -7.5934],
        ...,
        [ 23.7336, -30.1332,  -8.1190,  ..., -13.0223,  -1.7747,   5.4382],
        [ 24.7677, -29.2519,  -9.6620,  ...,  -0.6424,  -4.6334,  -8.3185],
        [ 23.9241, -31.0664,  -8.8748,  ...,  -3.3450,   2.4832,   3.8635]]) torch.Size([7249, 13])
tensor([[ 24.8688, -29.7277, -10.0829,  ...,   0.2335,  -5.1891, -10.5182],
        [ 23.7165, -29.3053, -11.0154,  ...,   8.8459,   4.9695,   1.7033],
        [ 24.3918, -30.2426, -17.2043,  ...,  -1.0753,  -3.7638,   1.8900],
        ...,
        [ 23.8795, -28.2105, -13.3643,  ...,  -0.4222,  -6.8063,   3.2779],
        [ 25.2789, -27.4087,  -4.5631,  ...,  -3.4745,   8.7959,   4.0152],
        [ 25.1426, -30.7162, -10.8394,  ..., -19.6604,  -1.2420,   2.3714]]) torch.Size([7249, 13])

Kaldi are tensor 1
torch.kaldi are tensor 2
torch.kaldi again are tensor 3

All different

@vincentqb
Copy link
Contributor

torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)
torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)

Have you tried setting dither=0. in mfcc's call? See #371.

@pablomainar
Copy link

Hi, I also have problems with the MFCCs. When I compare the MFCCs generated by Kaldi with the ones generated by PyTorch, I get very similar results for all the coefficients except for the first one. For some reason, there is a difference between the kaldi and pytorch first coefficient of about 100. Still the pattern is the same, so I suspect that there is some kind of energy normalization going on.

For reproducibility, in kaldi I use the s5 recipe of the librispeech example. I use the MFCCs generated at stage 6. The mfcc.conf file only has the use-energy flag set to false, all the other parameters of compute-mfcc-feats are default. The audio from the images below is the 1089-134686-0000.flac from test_clean set.

To compare the features I use kaldiio to convert kaldi features into numpy:

import kaldiio
import numpy as np
path_feat = 'ark:raw_mfcc_test_clean.1.ark'
with kaldiio.ReadHelper(path_feat) as reader:
    for key,kaldi_feats in reader:
        break
    kaldi_feats = np.transpose(kaldi_feats)

For pytorch's features:

import torchaudio
path_audio = 'LibriSpeech/test-clean/1089/134686/1089-134686-0000.flac'
audio_tensor,sr = torchaudio.load(path_audio)
torch_feats = torchaudio.compliance.kaldi.mfcc(waveform=audio_tensor,dither=0)
torch_feats = np.transpose(torch_feats.numpy())

I set dither to 0 according to #157. If dither is a low number (0.1) the features are still very similar, but if dither is 1 then they are completely different, but this a separate problem.

kaldi_feats
pytorch

Why do I get this difference in the first coefficients? Is there some normalization step done inside pytorch's compliance library that is not done in kaldi?

@pablomainar
Copy link

pablomainar commented Feb 13, 2020

Update: I have gone back to the spectrogram level trying to find the bug. If I set the flag subtract_mean to true in both kaldi and pytorch, the resulting spectrogram is (almost) the same. But if I set is as false (which is default), the results are different: they have the same pattern but the mean is different.

Kaldi code to generate spectrograms with mean subtraction:
~/kaldi/src/featbin/compute-spectrogram-feats --subtract-mean=true --dither=0.0 --energy-floor=1.0 scp,p:wav.scp ark:generated_feats/spec.ark

Pytorch code to generate spectrograms with mean subtraction:
torch_feats = torchaudio_local.spectrogram(waveform=audio_tensor,dither=0,energy_floor=1.0,subtract_mean=True)

Result:
kaldi_feats
torch_feats

Kaldi code to generate spectrograms without mean subtraction:
~/kaldi/src/featbin/compute-spectrogram-feats --subtract-mean=false --dither=0.0 --energy-floor=1.0 scp,p:wav.scp ark:generated_feats/spec.ark

Pytorch code to generate spectrograms with mean subtraction:
torch_feats = torchaudio_local.spectrogram(waveform=audio_tensor,dither=0,energy_floor=1.0,subtract_mean=False)

Result:
kaldi_feats_nonsub
torch_feats_nonsub

I suspect that there is something on the FFT computation that is normalizing in one but not in other. Any thoughts?

@nmfisher
Copy link

nmfisher commented Apr 1, 2021

VAD_demo.zip

Also, torchaudio.compliance.kaldi.mfcc doesn't produce the exact output as compute-mfcc-feats when htk_compat is False.

When true:

import torchaudio
import torch
import numpy as np

import librosa
wave_file = 'VAD_demo.wav'

audio, sample_rate = librosa.load(wave_file, sr=16000)

torchaudio.compliance.kaldi.mfcc( 
    waveform=torch.Tensor(torch.Tensor(audio).unsqueeze(0)),
    frame_length=100, 
    frame_shift=10, 
    num_ceps= 64,
    num_mel_bins= 64,
    htk_compat=True,
    snip_edges=False)

tensor([[ 1.5239e+01,  4.0710e+00, -9.1076e+00,  ..., -1.3214e+00,
         -3.4604e+00, -7.5218e+01],
        [ 1.5456e+01,  3.7524e+00, -9.5476e+00,  ..., -7.0259e-01,
         -2.7650e+00, -7.4910e+01],
        [ 1.5704e+01,  3.8294e+00, -9.3671e+00,  ..., -1.4487e-01,
         -1.7066e+00, -7.4704e+01],
        ...,
        [ 8.0927e+00, -7.5448e+00, -8.8725e+00,  ...,  7.1561e-01,
         -1.3807e+00, -6.7251e+01],
        [ 8.0175e+00, -8.2607e+00, -1.1727e+01,  ...,  2.2736e-02,
         -2.0510e+00, -6.9318e+01],
        [ 7.5882e+00, -8.9392e+00, -1.3061e+01,  ..., -9.1568e-01,
         -2.5486e+00, -7.0623e+01]])

and

compute-mfcc-feats --frame-length=100 --frame-shift=10 --num-ceps=64 --num-mel-bins=64 --snip-edges=false --htk-compat=true --dither=0 --energy-floor=1 scp:wav.scp ark,t:-
WARNING (compute-mfcc-feats[5.5.689~5-2f3d]:Read():wave-reader.cc:260) Expected 95516 bytes in RIFF chunk, but after first data block there will be 36 + 95280 bytes (we do not support reading multiple data chunks).
wav  [
  15.23852 4.070992 -9.107697 9.433752 -9.323326 15.62674 -9.407998 0.7214716 0.9599227 -5.44089 -11.26766 -6.221908 1.175821 -0.5180629 -2.197447 -8.726407 -1.848569 3.453872 4.1832 -2.637658 -0.2607065 -0.1641882 -0.0722182 -1.466432 -0.9918076 -0.1467309 4.546407 4.537716 -1.037911 6.192455 3.814458 3.485815 13.67851 7.93261 0.2159225 -2.56819 -12.5645 -7.210853 1.150025 0.793225 -0.577466 -1.334647 -0.0593026 0.081699 -0.9379031 3.676176 1.271123 2.297236 5.066513 0.2240368 -0.7410111 -2.629601 3.790086 9.951916 4.806147 3.494368 1.054438 6.064842 1.139308 -4.126108 0.1679086 -1.321496 -3.460518 17.38356
  15.45577 3.752458 -9.547602 10.18087 -9.88015 15.14798 -10.30561 0.4286905 -1.160449 -5.356637 -13.52929 -6.287968 -0.5804175 0.5506579 -1.780558 -5.887392 -1.011132 3.633945 3.231638 -3.511573 -0.1174743 -0.03293979 -0.09964671 -1.642807 -0.7558402 -0.1005801 3.603968 4.117401 -1.777455 4.002991 2.848278 4.0697 13.47057 5.628294 -0.7281701 -2.973053 -10.5222 -5.910959 1.049334 -0.4542561 -1.303142 -0.8375537 -0.03831673 0.02188638 -0.7784259 2.942248 0.9490441 2.425136 4.28632 -0.1520254 -0.2199809 -2.220483 1.599042 6.40525 2.825276 4.396042 3.431517 6.499466 1.625714 -3.128053 1.206802 -0.7026786 -2.765082 17.25393
...

they match, but set htk_compat to false:

torchaudio.compliance.kaldi.mfcc( 
    waveform=torch.Tensor(torch.Tensor(audio).unsqueeze(0)),
    frame_length=100, 
    frame_shift=10, 
    num_ceps= 64,
    num_mel_bins= 64,
    htk_compat=False,
    dither=0,
    energy_floor=1,
    snip_edges=False)
tensor([[-5.3187e+01,  1.5239e+01,  4.0710e+00,  ...,  1.6768e-01,
         -1.3214e+00, -3.4604e+00],
        [-5.2969e+01,  1.5456e+01,  3.7524e+00,  ...,  1.2066e+00,
         -7.0259e-01, -2.7650e+00],
        [-5.2823e+01,  1.5704e+01,  3.8294e+00,  ...,  2.3663e+00,
         -1.4487e-01, -1.7066e+00],
        ...,
        [-4.7554e+01,  8.0927e+00, -7.5448e+00,  ..., -6.1891e+00,
          7.1561e-01, -1.3807e+00],
        [-4.9015e+01,  8.0175e+00, -8.2607e+00,  ..., -6.8149e+00,
          2.2736e-02, -2.0510e+00],
        [-4.9938e+01,  7.5882e+00, -8.9392e+00,  ..., -7.7555e+00,
         -9.1568e-01, -2.5486e+00]])
compute-mfcc-feats --frame-length=100 --frame-shift=10 --num-ceps=64 --num-mel-bins=64 --snip-edges=false --htk-compat=false --dither=0 --energy-floor=1 scp:wav.scp ark,t:-
WARNING (compute-mfcc-feats[5.5.689~5-2f3d]:Read():wave-reader.cc:260) Expected 95516 bytes in RIFF chunk, but after first data block there will be 36 + 95280 bytes (we do not support reading multiple data chunks).
wav  [
  17.38356 15.23852 4.070992 -9.107697 9.433752 -9.323326 15.62674 -9.407998 0.7214716 0.9599227 -5.44089 -11.26766 -6.221908 1.175821 -0.5180629 -2.197447 -8.726407 -1.848569 3.453872 4.1832 -2.637658 -0.2607065 -0.1641882 -0.0722182 -1.466432 -0.9918076 -0.1467309 4.546407 4.537716 -1.037911 6.192455 3.814458 3.485815 13.67851 7.93261 0.2159225 -2.56819 -12.5645 -7.210853 1.150025 0.793225 -0.577466 -1.334647 -0.0593026 0.081699 -0.9379031 3.676176 1.271123 2.297236 5.066513 0.2240368 -0.7410111 -2.629601 3.790086 9.951916 4.806147 3.494368 1.054438 6.064842 1.139308 -4.126108 0.1679086 -1.321496 -3.460518
  17.25393 15.45577 3.752458 -9.547602 10.18087 -9.88015 15.14798 -10.30561 0.4286905 -1.160449 -5.356637 -13.52929 -6.287968 -0.5804175 0.5506579 -1.780558 -5.887392 -1.011132 3.633945 3.231638 -3.511573 -0.1174743 -0.03293979 -0.09964671 -1.642807 -0.7558402 -0.1005801 3.603968 4.117401 -1.777455 4.002991 2.848278 4.0697 13.47057 5.628294 -0.7281701 -2.973053 -10.5222 -5.910959 1.049334 -0.4542561 -1.303142 -0.8375537 -0.03831673 0.02188638 -0.7784259 2.942248 0.9490441 2.425136 4.28632 -0.1520254 -0.2199809 -2.220483 1.599042 6.40525 2.825276 4.396042 3.431517 6.499466 1.625714 -3.128053 1.206802 -0.7026786 -2.765082
...

Also I assume the default dither and energy_floor parameters don't match either, if they're not explicitly set to 0/1 respectively, the results also differ.

@vincentqb vincentqb removed their assignment Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants