Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why n_fft=400 by default in Transforms? #384

Closed
keunwoochoi opened this issue Dec 29, 2019 · 7 comments
Closed

Why n_fft=400 by default in Transforms? #384

keunwoochoi opened this issue Dec 29, 2019 · 7 comments
Assignees

Comments

@keunwoochoi
Copy link

In MelSpectrogram, Spectrogram, GriffinLim, n_fft defaults to 400. Is there a reason for not setting it with a power of 2?

@faroit
Copy link
Contributor

faroit commented Dec 29, 2019

I guess this is coming from old days of (time domain) speech processing where the sample rate is 8kHz and 50ms are considered to be a good window size. Since 512 is also common, changing the default could indeed make sense...

@vincentqb
Copy link
Contributor

The current value has been around for a while, and hasn't changed since this would be BC-breaking. I'd be ok changing it, if there is a strong reason to do so. Thoughts?

@vincentqb vincentqb self-assigned this Jan 2, 2020
@keunwoochoi
Copy link
Author

I see. I have no benchmark data, but thought the backed fft operation could be more efficient with `n_fft=2**N. But I don't have a strong opinion now - after googling I realized non-power-of-two fft could be efficient too :)

@vincentqb
Copy link
Contributor

Quick run, and I don't see significant differences :)

In [1]: f = "steam-train-whistle-daniel_simon.wav"                                                                                                                                                          

In [2]: import torchaudio                                                                                                                                                                                   

In [3]: w, s = torchaudio.load(f)                                                                                                                                                                           

In [4]: %timeit torchaudio.transforms.Spectrogram(n_fft=400)(w)                                                                                                                                             
10.5 ms ± 376 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit torchaudio.transforms.Spectrogram(n_fft=512)(w)                                                                                                                                             
11.1 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@vincentqb
Copy link
Contributor

I will close this issue for now. Please feel free to re-open if there are more elements you would like to add to the discussion :)

@PetrochukM
Copy link

@faroit What would you consider a better window size today? In the recent Tacotron paper, they also used a 50-millisecond frame size; however, the Kaldi spectrogram recommends a 25-millisecond frame size. From my online readings, it sounds like 20 - 30 milliseconds is recommended for a text-to-speech application with a 50% hop length.

@faroit
Copy link
Contributor

faroit commented Mar 10, 2020

@PetrochukM Yes, maybe 32ms (fft = 512) would be a better fit with respect to performance as pointed out by @keunwoochoi

mthrok pushed a commit to mthrok/audio that referenced this issue Feb 26, 2021
mthrok pushed a commit to mthrok/audio that referenced this issue Feb 26, 2021
some improvements to the make download(pytorch#384)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants