-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why n_fft=400 by default in Transforms? #384
Comments
I guess this is coming from old days of (time domain) speech processing where the sample rate is 8kHz and 50ms are considered to be a good window size. Since 512 is also common, changing the default could indeed make sense... |
The current value has been around for a while, and hasn't changed since this would be BC-breaking. I'd be ok changing it, if there is a strong reason to do so. Thoughts? |
I see. I have no benchmark data, but thought the backed fft operation could be more efficient with `n_fft=2**N. But I don't have a strong opinion now - after googling I realized non-power-of-two fft could be efficient too :) |
Quick run, and I don't see significant differences :)
|
I will close this issue for now. Please feel free to re-open if there are more elements you would like to add to the discussion :) |
@faroit What would you consider a better window size today? In the recent Tacotron paper, they also used a 50-millisecond frame size; however, the Kaldi spectrogram recommends a 25-millisecond frame size. From my online readings, it sounds like 20 - 30 milliseconds is recommended for a text-to-speech application with a 50% hop length. |
@PetrochukM Yes, maybe 32ms (fft = 512) would be a better fit with respect to performance as pointed out by @keunwoochoi |
some improvements to the make download(pytorch#384)
In
MelSpectrogram
,Spectrogram
,GriffinLim
,n_fft
defaults to 400. Is there a reason for not setting it with a power of 2?The text was updated successfully, but these errors were encountered: