Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fbank features are different from Kaldi Fbank #400

Open
jooan84 opened this issue Jan 10, 2020 · 11 comments
Open

Fbank features are different from Kaldi Fbank #400

jooan84 opened this issue Jan 10, 2020 · 11 comments
Assignees

Comments

@jooan84
Copy link

jooan84 commented Jan 10, 2020

🐛 Bug

The output of the fbank feature calculations differs from that of kaldi.

To Reproduce

Steps to reproduce the behavior:

using the following or even the defaults parameters:

 torchaudio.compliance.kaldi.fbank(waveform, blackman_coeff=0.42, channel=-1, dither=1.0, energy_floor=0.0, frame_length=25.0, frame_shift=10.0, high_freq=0.0, htk_compat=True, low_freq=20.0, min_duration=0.0, num_mel_bins=40, preemphasis_coefficient=0.97, raw_energy=True, remove_dc_offset=True, round_to_power_of_two=True, sample_frequency=16000.0, snip_edges=True, subtract_mean=False, use_energy=False, use_log_fbank=True,use_power=True, vtln_high=-500.0, vtln_low=100.0, vtln_warp=1.0, window_type='hamming')[0]

produce this output:

tensor([-0.7616, -0.4791,  0.2155,  0.7661,  2.0723,  1.4565,  2.9888,  3.2548,
         1.8460,  3.5807,  3.8290,  4.1785,  4.6776,  4.5801,  5.3610,  4.4910,
         5.1519,  5.3534,  5.2783,  5.6159,  6.0689,  5.5961,  5.8068,  5.0957,
         6.5200,  6.9314,  6.1741,  7.0430,  7.9394,  8.2380,  8.7115,  8.4105,
         8.3154,  8.2186,  7.9444,  8.4468,  8.4293,  8.9476,  9.1008,  9.2495])

with compute_fbank_feats of Kaldi

tensor([12.9911, 12.9795, 12.9127, 13.6171, 13.7416, 15.1579, 15.1996, 14.9468,
        14.1368, 14.8717, 14.8265, 13.8715, 15.2716, 15.0743, 15.2439, 15.3904,
        13.9460, 13.5932, 14.0038, 14.8721, 13.9944, 15.8337, 14.8682, 13.8247,
        15.0769, 15.1141, 15.1482, 14.7864, 13.6259, 14.4092, 14.1771, 13.6139,
        13.8014, 12.5796,  9.1051,  8.3382,  8.3738,  8.7829,  9.2973,  9.4913])
@vincentqb
Copy link
Contributor

@vincentqb vincentqb self-assigned this Jan 13, 2020
@mthrok
Copy link
Collaborator

mthrok commented Apr 17, 2020

I looked into this and took a while to figure out why.

When you use fbank function, you need to normalize the audio and for that you need to use torchaudio.load_wav function instead of torchaudio.load.

See my test or existing test.

This is extremely subtle.

@mthrok mthrok closed this as completed Apr 17, 2020
@cpuhrsch
Copy link
Contributor

@mthrok - should we add documentation about this or otherwise try to prevent this issue coming up again in the future? I'm surprised we have a need for a separate load_wav to begin with.

@vincentqb
Copy link
Contributor

I second @cpuhrsch: I'm also surprised that we torchaudio.load does not work here.

@vincentqb
Copy link
Contributor

I don't believe we should rely on load_wav to fix this issue.

@RuABraun
Copy link

RuABraun commented Jan 4, 2021

edit: After some testing it seems to get the closest match one has to do no normalisation but times by 2**15 ?

@mthrok normalising audio does not help for me, code:

    data, fs = sf.read('/idiap/resource/database/LibriSpeech/train-clean-360/100/121669/100-121669-0000.flac')
    data = to.from_numpy(data).float()
    data /= data.max()
    f = fbank(data.unsqueeze(0), num_mel_bins=40, low_freq=40, high_freq=7600)

    kaldi_feats = None
    for uttid, m in kaldi_io.read_mat_scp('scp:feats.scp'):
        kaldi_feats = m
    print(uttid)
    print(kaldi_feats[:2])
    print(f[:2])

output

    100-121669-0000-1
    [[ 8.129056   7.732553   7.6204824  6.776312   7.437045   8.823427
   8.736998   8.304144   8.411314   8.19662    6.130655   8.646175
   9.119083   9.085771   8.314858   9.277414   9.7172785  9.830122
   9.228786   9.078177   9.063866   9.667826   8.975353   9.46149
   9.655378   9.932469   9.935007  10.056624   9.357061  10.264997
  10.36901   10.563572  10.689384  11.149243  11.518983  10.866757
  10.359279  10.542366  11.021458  10.561819 ]
 [ 8.081877   7.8777122  6.87261    8.406      9.237014   8.542725
   7.0748315  7.555811   8.742043   9.1879     7.651375   7.56339
   8.07299    9.343008   9.155113   9.235215   9.285145   9.729772
   9.2692585  9.870285  10.123455   9.58822    9.321457   9.46149
   9.285657   9.631441  11.042232  10.012186   9.731838   9.504875
  10.895826  10.652676  10.899666  10.996901  10.666897  11.006931
  10.998066  11.225334  11.071218  10.741457 ]]
tensor([[-10.5861, -10.9795, -11.1278, -11.9309, -11.2997,  -9.8805,  -9.8985,
         -10.3205, -10.3428, -10.5305, -12.9941, -10.0486,  -9.5567,  -9.6991,
         -10.3325,  -9.3442,  -8.9814,  -8.8237,  -9.3472,  -9.6113,  -9.7424,
          -8.9508,  -9.7846,  -9.3923,  -8.8430,  -8.8997,  -8.7163,  -8.5314,
          -9.2710,  -8.6714,  -8.3952,  -8.3978,  -8.0870,  -7.5590,  -7.4100,
          -7.9227,  -8.4362,  -8.7195,  -8.0624,  -8.5884],
        [-10.5894, -10.7786, -11.8293, -10.2971,  -9.4618, -10.1934, -11.7973,
         -11.3098, -10.0636,  -9.5083, -10.8814, -11.2168, -10.6213,  -9.4451,
          -9.5788,  -9.5073,  -9.5189,  -8.9797,  -9.5143,  -8.6416,  -8.4359,
          -9.1466,  -9.2892,  -9.3173,  -9.4014,  -9.2642,  -7.6490,  -8.6838,
          -9.0432,  -9.5034,  -7.9339,  -7.9784,  -7.9248,  -7.8987,  -8.2526,
          -8.2896,  -8.0052,  -7.9586,  -8.1519,  -8.1042]])

Also in my opinion if this is an important requirement then the function should check that the max is equal to 1 and warn otherwise.

Btw I don't think it's good to make the assumption of normalising audio as you can't do this in a realtime setting.

@mthrok
Copy link
Collaborator

mthrok commented Jan 5, 2021

Hi @RuABraun

As you figured out, normalization here means dtype conversion, that is float (with value range [-1, 1]) to int16 (with value range [-32,768, 32,767].

According to my recent talk with @cpuhrsch, this fbank feature is not intended for precise match with the Kaldi's implementation.

I found that our test suite for this function which I thought was covering it was not enough and it does not match the Kaldi's result.

I personally think that it is more confusing to have a module named compliance, which is implicitly not meant to match. Also we are getting rid of load_wav function, so we do need to change things around compliance.kaldi module.

To lower the maintenance cost, I am in favor of building Kaldi and binding, which guarantees all the Kaldi related features to match perfectly with Kaldi's result but that opinion is not getting a support from anyone.

Similar issue is raised at #328

@RuABraun
Copy link

RuABraun commented Jan 5, 2021

Thank you for the explanation! :)

@njusq
Copy link

njusq commented Dec 16, 2021

We also had the same problem two days ago under the setting subtrach_mean = False.
We compared the results of torchaudio's fbank and kaldi's compute-fbank-feats line-by-line. The differences occured from the values of input.
It is really confusing that the input of torchaudio's fbank should be float number in the range of [-32,768., 32,767.] ( not float [-1.,1.] or int16 [-32,768, 32,767]).
We fixed the problem by loading one piece of 16-bit .wav with dtype='int16' and converted the signal value to float directly without any normalization. e.g. We converted the value -3 to -3.0.
After fixing, the result shows that:

err between kaldi and torchaudio res (1.2798111e-07, 1.177518e-05, 7.390976e-05)
kaldi res: tensor([[ 7.7390,  6.6414,  5.9847,  ..., 11.2153, 10.8115, 10.6624],
        [ 8.3844,  7.3069,  5.6935,  ..., 11.3059, 11.9750, 11.1324],
        [ 5.9230,  4.5791,  6.4441,  ..., 11.5842, 12.4497, 11.9442],
        ...,
        [ 8.3075,  7.4419,  6.3531,  ...,  8.7440,  9.0616,  8.9001],
        [ 7.9940,  7.1240,  4.0873,  ...,  8.4048,  8.6729,  9.1240],
        [ 8.7946,  6.5140,  6.0803,  ...,  8.8812,  8.5578,  8.0560]])

torchaudio res tensor([[ 7.7390,  6.6414,  5.9847,  ..., 11.2153, 10.8115, 10.6624],
        [ 8.3844,  7.3069,  5.6935,  ..., 11.3059, 11.9750, 11.1324],
        [ 5.9230,  4.5791,  6.4441,  ..., 11.5842, 12.4497, 11.9442],
        ...,
        [ 8.3075,  7.4419,  6.3531,  ...,  8.7440,  9.0616,  8.9001],
        [ 7.9940,  7.1240,  4.0873,  ...,  8.4048,  8.6729,  9.1240],
        [ 8.7946,  6.5140,  6.0803,  ...,  8.8812,  8.5578,  8.0560]])

@Wonder1905
Copy link

Can you please share your code? it will be very useful!

@mthrok
Copy link
Collaborator

mthrok commented Mar 29, 2022

Can you please share your code? it will be very useful!

@BattashB

Something like this

waveform, sample_rarte = torchaudio.load(<file>)  # waveform is float32, value range [-1, 1]
waveform = waveform * (2 << 16) # convert the value range to  [-32,768., 32,767.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants