-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase mfcc step size instead of throwing away feature frames #1744
Comments
I'd recommend increasing winlen to 0.032 to match the 512-sample FFT. This avoids creating the step discontinuity when using the traditional Hamming window. That is, a 480-sample length adds 32 zeroes to create a 512-sample FFT, and the discontinuity of the raised cosine window at sample 480 produces some spectral splatter. HOWEVER, it appears to me that line 226 of deepspeech.cc invokes feature generation with NO window function at all, and that's a serious problem. Is that indeed the usual path for audio feature generation? Also note that any of these changes will make existing models incompatible with new audio, to various degrees. I don't know what your compatibility policy is for this. |
Without a window function, the current implementation with 0.025s length has two discontinuities in the FFT input: one at sample 400 and one at sample 512/0. Not good. |
Currently, as we are not to 1.0, we are free to break backwards compatibility when the engine benefits. |
@khsinclair thanks for the tips! I've created PR #1773 fixing this. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
DeepSpeech/util/audio.py
Lines 17 to 21 in a3a96cf
We could instead pass
winlen=0.03s
andwinstep=0.02s
tomfcc
to get the same rate of feature windows over time, but without discarding any data.The text was updated successfully, but these errors were encountered: