-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: when preprocessing, everyvoice forces equal length time and f… #421
Conversation
f756ce9
to
145b98e
Compare
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #421 +/- ##
==========================================
+ Coverage 73.45% 73.49% +0.03%
==========================================
Files 43 43
Lines 2837 2852 +15
Branches 467 468 +1
==========================================
+ Hits 2084 2096 +12
- Misses 668 671 +3
Partials 85 85 ☔ View full report in Codecov by Sentry. |
…ature reps audio must be divisible by the declared hop size the number of frames in the spectrogram must exactly equal the number of samples in the audio when multiplied by the hop size
145b98e
to
a96307d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put the calculation of the number of frame in a function to make the code DRY and to help ourselves in future if we have to change how we calculate the end of the wav file.
@@ -741,6 +759,14 @@ def process_text( | |||
return (character_tokens, phone_tokens, pfs) | |||
|
|||
def process_spec(self, item): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should annotate the return type
…eature reps
audio must be divisible by the declared hop size
the number of frames in the spectrogram must exactly equal the number of samples in the audio when multiplied by the hop size
PR Goal?
We had very tiny mismatches between length of the audio and Mel spectrograms when the number of samples in the audio was not evenly divisible by the hop size. This PR truncates the audio when preprocessing to be evenly divisible by the hop size.
This is implemented for both the input and (potentially) upsampled audio in the case of vocoder that applies super-resolution to input spectrograms.
This means that we effectively discard up to
hop_size - 1
samples in the audio. For the default (22.05 kHz audio with 256 hop size) this means that we discard up to 255/22050 = 11.5ms of audio, which I think is OK :)Fixes?
Hopefully makes it easier to some downstream models that depend on exact alignment between
Feedback sought?
Sanity checking
Priority?
low/medium
Tests added?
Added some unittests
How to test?
Just having a look at the code is sufficient. If you want, you could try preprocessing some data and ensure that the resulting audio is divisible by the hop size (256 by default). Similarly you should be able to read in the preprocessed spectral features and multiply the 2nd dimension by the hop size to obtain the exact number of samples in the audio
Confidence?
high
Version change?
If we weren't in pre-alpha yes.