Improve alignment accuracy by normalizing audio features #625
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Audio data should be pre-processed using the
Wav2Vec2Processor (Wav2Vec2FeatureExtractor)
, I have noticed considerable alignment improvement(Mean absolute error)
when audio is normalized(zero mean and unit variance)
using the processor before the forward pass.Other than that, Each Hugging face Wav2Vec2 Feature Extractor configuration should contain the same config used during fine-tuning these models (e.g. normalization, attention_mask usage, etc..)
A typical
hugging face Wav2Vec2 Feature Extractor config file
is as follows:To maintain backwards compatibility, I have opted to let the user determine if Pre-processing should be applied or not, but chose to set
Pre-processing as the default option
as it improves alignment considerably.