Fbank features are different from Kaldi Fbank #400

jooan84 · 2020-01-10T15:28:48Z

🐛 Bug

The output of the fbank feature calculations differs from that of kaldi.

To Reproduce

Steps to reproduce the behavior:

using the following or even the defaults parameters:

 torchaudio.compliance.kaldi.fbank(waveform, blackman_coeff=0.42, channel=-1, dither=1.0, energy_floor=0.0, frame_length=25.0, frame_shift=10.0, high_freq=0.0, htk_compat=True, low_freq=20.0, min_duration=0.0, num_mel_bins=40, preemphasis_coefficient=0.97, raw_energy=True, remove_dc_offset=True, round_to_power_of_two=True, sample_frequency=16000.0, snip_edges=True, subtract_mean=False, use_energy=False, use_log_fbank=True,use_power=True, vtln_high=-500.0, vtln_low=100.0, vtln_warp=1.0, window_type='hamming')[0]

produce this output:

tensor([-0.7616, -0.4791,  0.2155,  0.7661,  2.0723,  1.4565,  2.9888,  3.2548,
         1.8460,  3.5807,  3.8290,  4.1785,  4.6776,  4.5801,  5.3610,  4.4910,
         5.1519,  5.3534,  5.2783,  5.6159,  6.0689,  5.5961,  5.8068,  5.0957,
         6.5200,  6.9314,  6.1741,  7.0430,  7.9394,  8.2380,  8.7115,  8.4105,
         8.3154,  8.2186,  7.9444,  8.4468,  8.4293,  8.9476,  9.1008,  9.2495])

with compute_fbank_feats of Kaldi

tensor([12.9911, 12.9795, 12.9127, 13.6171, 13.7416, 15.1579, 15.1996, 14.9468,
        14.1368, 14.8717, 14.8265, 13.8715, 15.2716, 15.0743, 15.2439, 15.3904,
        13.9460, 13.5932, 14.0038, 14.8721, 13.9944, 15.8337, 14.8682, 13.8247,
        15.0769, 15.1141, 15.1482, 14.7864, 13.6259, 14.4092, 14.1771, 13.6139,
        13.8014, 12.5796,  9.1051,  8.3382,  8.3738,  8.7829,  9.2973,  9.4913])

The text was updated successfully, but these errors were encountered:

vincentqb · 2020-01-13T16:58:01Z

Can you provide the kaldi command you used?
Can you provide a sample file so we can reproduce?
Note that you are using dither=1.0 which adds dither.
See also The spectrogram computed by "torchaudio.compliance.kaldi.spectrogram" and "compute-spectrogram-feats" are different #332.

mthrok · 2020-04-17T23:16:20Z

I looked into this and took a while to figure out why.

When you use fbank function, you need to normalize the audio and for that you need to use torchaudio.load_wav function instead of torchaudio.load.

See my test or existing test.

This is extremely subtle.

cpuhrsch · 2020-04-29T04:01:06Z

@mthrok - should we add documentation about this or otherwise try to prevent this issue coming up again in the future? I'm surprised we have a need for a separate load_wav to begin with.

vincentqb · 2020-04-29T05:22:06Z

I second @cpuhrsch: I'm also surprised that we torchaudio.load does not work here.

vincentqb · 2020-05-01T21:59:26Z

I don't believe we should rely on load_wav to fix this issue.

RuABraun · 2021-01-04T21:37:49Z

edit: After some testing it seems to get the closest match one has to do no normalisation but times by 2**15 ?

@mthrok normalising audio does not help for me, code:

    data, fs = sf.read('/idiap/resource/database/LibriSpeech/train-clean-360/100/121669/100-121669-0000.flac')
    data = to.from_numpy(data).float()
    data /= data.max()
    f = fbank(data.unsqueeze(0), num_mel_bins=40, low_freq=40, high_freq=7600)

    kaldi_feats = None
    for uttid, m in kaldi_io.read_mat_scp('scp:feats.scp'):
        kaldi_feats = m
    print(uttid)
    print(kaldi_feats[:2])
    print(f[:2])

output

    100-121669-0000-1
    [[ 8.129056   7.732553   7.6204824  6.776312   7.437045   8.823427
   8.736998   8.304144   8.411314   8.19662    6.130655   8.646175
   9.119083   9.085771   8.314858   9.277414   9.7172785  9.830122
   9.228786   9.078177   9.063866   9.667826   8.975353   9.46149
   9.655378   9.932469   9.935007  10.056624   9.357061  10.264997
  10.36901   10.563572  10.689384  11.149243  11.518983  10.866757
  10.359279  10.542366  11.021458  10.561819 ]
 [ 8.081877   7.8777122  6.87261    8.406      9.237014   8.542725
   7.0748315  7.555811   8.742043   9.1879     7.651375   7.56339
   8.07299    9.343008   9.155113   9.235215   9.285145   9.729772
   9.2692585  9.870285  10.123455   9.58822    9.321457   9.46149
   9.285657   9.631441  11.042232  10.012186   9.731838   9.504875
  10.895826  10.652676  10.899666  10.996901  10.666897  11.006931
  10.998066  11.225334  11.071218  10.741457 ]]
tensor([[-10.5861, -10.9795, -11.1278, -11.9309, -11.2997,  -9.8805,  -9.8985,
         -10.3205, -10.3428, -10.5305, -12.9941, -10.0486,  -9.5567,  -9.6991,
         -10.3325,  -9.3442,  -8.9814,  -8.8237,  -9.3472,  -9.6113,  -9.7424,
          -8.9508,  -9.7846,  -9.3923,  -8.8430,  -8.8997,  -8.7163,  -8.5314,
          -9.2710,  -8.6714,  -8.3952,  -8.3978,  -8.0870,  -7.5590,  -7.4100,
          -7.9227,  -8.4362,  -8.7195,  -8.0624,  -8.5884],
        [-10.5894, -10.7786, -11.8293, -10.2971,  -9.4618, -10.1934, -11.7973,
         -11.3098, -10.0636,  -9.5083, -10.8814, -11.2168, -10.6213,  -9.4451,
          -9.5788,  -9.5073,  -9.5189,  -8.9797,  -9.5143,  -8.6416,  -8.4359,
          -9.1466,  -9.2892,  -9.3173,  -9.4014,  -9.2642,  -7.6490,  -8.6838,
          -9.0432,  -9.5034,  -7.9339,  -7.9784,  -7.9248,  -7.8987,  -8.2526,
          -8.2896,  -8.0052,  -7.9586,  -8.1519,  -8.1042]])

~~Also in my opinion if this is an important requirement then the function should check that the max is equal to 1 and warn otherwise.~~

Btw I don't think it's good to make the assumption of normalising audio as you can't do this in a realtime setting.

mthrok · 2021-01-05T15:23:17Z

Hi @RuABraun

As you figured out, normalization here means dtype conversion, that is float (with value range [-1, 1]) to int16 (with value range [-32,768, 32,767].

According to my recent talk with @cpuhrsch, this fbank feature is not intended for precise match with the Kaldi's implementation.

I found that our test suite for this function which I thought was covering it was not enough and it does not match the Kaldi's result.

I personally think that it is more confusing to have a module named compliance, which is implicitly not meant to match. Also we are getting rid of load_wav function, so we do need to change things around compliance.kaldi module.

To lower the maintenance cost, I am in favor of building Kaldi and binding, which guarantees all the Kaldi related features to match perfectly with Kaldi's result but that opinion is not getting a support from anyone.

Similar issue is raised at #328

RuABraun · 2021-01-05T17:40:56Z

Thank you for the explanation! :)

njusq · 2021-12-16T02:56:10Z

We also had the same problem two days ago under the setting subtrach_mean = False.
We compared the results of torchaudio's fbank and kaldi's compute-fbank-feats line-by-line. The differences occured from the values of input.
It is really confusing that the input of torchaudio's fbank should be float number in the range of [-32,768., 32,767.] ( not float [-1.,1.] or int16 [-32,768, 32,767]).
We fixed the problem by loading one piece of 16-bit .wav with dtype='int16' and converted the signal value to float directly without any normalization. e.g. We converted the value -3 to -3.0.
After fixing, the result shows that:

err between kaldi and torchaudio res (1.2798111e-07, 1.177518e-05, 7.390976e-05)
kaldi res: tensor([[ 7.7390,  6.6414,  5.9847,  ..., 11.2153, 10.8115, 10.6624],
        [ 8.3844,  7.3069,  5.6935,  ..., 11.3059, 11.9750, 11.1324],
        [ 5.9230,  4.5791,  6.4441,  ..., 11.5842, 12.4497, 11.9442],
        ...,
        [ 8.3075,  7.4419,  6.3531,  ...,  8.7440,  9.0616,  8.9001],
        [ 7.9940,  7.1240,  4.0873,  ...,  8.4048,  8.6729,  9.1240],
        [ 8.7946,  6.5140,  6.0803,  ...,  8.8812,  8.5578,  8.0560]])

torchaudio res tensor([[ 7.7390,  6.6414,  5.9847,  ..., 11.2153, 10.8115, 10.6624],
        [ 8.3844,  7.3069,  5.6935,  ..., 11.3059, 11.9750, 11.1324],
        [ 5.9230,  4.5791,  6.4441,  ..., 11.5842, 12.4497, 11.9442],
        ...,
        [ 8.3075,  7.4419,  6.3531,  ...,  8.7440,  9.0616,  8.9001],
        [ 7.9940,  7.1240,  4.0873,  ...,  8.4048,  8.6729,  9.1240],
        [ 8.7946,  6.5140,  6.0803,  ...,  8.8812,  8.5578,  8.0560]])

Wonder1905 · 2022-03-29T07:45:32Z

Can you please share your code? it will be very useful!

mthrok · 2022-03-29T17:03:29Z

Can you please share your code? it will be very useful!

@BattashB

Something like this

waveform, sample_rarte = torchaudio.load(<file>)  # waveform is float32, value range [-1, 1]
waveform = waveform * (2 << 16) # convert the value range to  [-32,768., 32,767.]

vincentqb self-assigned this Jan 13, 2020

mthrok closed this as completed Apr 17, 2020

mthrok mentioned this issue May 1, 2020

Add compatibility test for compute-fbank-feats #602

Merged

vincentqb reopened this May 1, 2020

mthrok mentioned this issue Jun 18, 2020

Add TorchScript-based SoX I/O backend #726

Merged

6 tasks

mthrok mentioned this issue Feb 15, 2021

RFC: The future of Kaldi compliance module #1269

Open

axuan731 mentioned this issue Mar 18, 2024

Problem of different embedding extractors BUTSpeechFIT/VBx#65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fbank features are different from Kaldi Fbank #400

Fbank features are different from Kaldi Fbank #400

jooan84 commented Jan 10, 2020 •

edited by vincentqb

Loading

vincentqb commented Jan 13, 2020

mthrok commented Apr 17, 2020

cpuhrsch commented Apr 29, 2020

vincentqb commented Apr 29, 2020

vincentqb commented May 1, 2020

RuABraun commented Jan 4, 2021 •

edited

Loading

mthrok commented Jan 5, 2021 •

edited

Loading

RuABraun commented Jan 5, 2021

njusq commented Dec 16, 2021 •

edited

Loading

Wonder1905 commented Mar 29, 2022

mthrok commented Mar 29, 2022

Fbank features are different from Kaldi Fbank #400

Fbank features are different from Kaldi Fbank #400

Comments

jooan84 commented Jan 10, 2020 • edited by vincentqb Loading

🐛 Bug

To Reproduce

vincentqb commented Jan 13, 2020

mthrok commented Apr 17, 2020

cpuhrsch commented Apr 29, 2020

vincentqb commented Apr 29, 2020

vincentqb commented May 1, 2020

RuABraun commented Jan 4, 2021 • edited Loading

mthrok commented Jan 5, 2021 • edited Loading

RuABraun commented Jan 5, 2021

njusq commented Dec 16, 2021 • edited Loading

Wonder1905 commented Mar 29, 2022

mthrok commented Mar 29, 2022

jooan84 commented Jan 10, 2020 •

edited by vincentqb

Loading

RuABraun commented Jan 4, 2021 •

edited

Loading

mthrok commented Jan 5, 2021 •

edited

Loading

njusq commented Dec 16, 2021 •

edited

Loading