-
Notifications
You must be signed in to change notification settings - Fork 675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about HuBERT/Wav2Vec2 pre-training #2947
Comments
Hi @arlofaria, thanks for sharing these questions, which I think are very inspiring. To question 1, the advantage of using group normalization instead of layer normalization is the faster training speed, at the same time the gradient can be unstable, like what you observed in the loss curve. I think you can use layer normalization in Base model to stabilize the training. NOTE you need to also enable normalizing the waveform before feeding it to hubert model (see #2873) To question 2: I actually tried with 0.001 weight on unmasked frames and it indeed help increase the To question 3: This is true. Since the HuBERT paper mentioned that 0 weight for unmasked frames is optimal, I hardcoded the sample_size to be the number of frames in the masked frames. I can make it more flexible to let the loss function return two sample sizes and make the value to be 0 if the weight is 0. To question 4: I think the feature penalty loss is to make feature extraction layers more sparse or avoid overflow, so both masked and unmasked frames can be helpful to such purpose. Regarding normalization of masked and unmasked losses by respective lengths, I think it's a good idea. Would you like to run another experiments to see if that is beneficial to the training? If so we can add it to the current recipe. Thanks! |
Thanks for the helpful replies!
And one more question, which seems like it may be related to layer normalization:
|
Just following up on a few of these questions: 1. Regarding the training instability with NaN loss, and the "hiccup" in loss curve I don't have a clear explanation for the "hiccup" shape in the curves, but I think it might be an artifact of Lightning's logging: perhaps it appropriately zeros the gradients or skips the batches when encountering a NaN loss, but does not correctly update the batch size when logging the accumulated metric? 2. Regarding the drop in unmasked accuracy 4. Regarding the weighted combination of masked and unmasked losses 5. Regarding the bias vectors for the CNN feature extractor And lastly another question: 6. Regarding |
Yeah The plan for the |
I would certainly be very interested in a Data2Vec(2) recipe! One of the major practical drawbacks of the HuBERT approach is the bottleneck of dumping large hidden activations to disk storage before kmeans clustering; the multi-iteration training is also slightly inconvenient. AFAICT, Data2Vec seems to avoid those problems by keeping everything in memory and doing a single training; it seems that Data2Vec2 incorporates some very significant speedups and incremental accuracy improvements also. That's particularly helpful in comparison to HuBERT, where I've had to explore various shortcuts to scale experiments. I think the main concern is whether the approach is still being refined; if so, it might be best to wait a bit. For example, is there currently a Data2Vec3 being researched? A Wav2Vec2 recipe would also be nice. A specific use case to consider is the concept of further pre-training, where a well-trained large model is used as an initial checkpoint. For HuBERT, this required a bit of hacking for me to find the original k-means model and also patch in a few missing components from Fairseq models that were missing from their repackaged bundles in TorchAudio. However, I largely chose to explore HuBERT instead of Wav2Vec2 because of the availability of the recipe in TorchAudio, which seemed a bit easier to follow and perhaps more actively maintained than what's published in Fairseq. Also, I have been a very big fan of its use of Lightning, and would encourage structuring future recipes to continue using that, ideally with the enabled automatic optimization. |
Data2vec 2.0 is indeed promising as it outperforms both HuBERT and WavLM. Let me check with them to see if there is new iteration. I will look into the details when I get more bandwidth. Seems the key component is the teacher-student training and loss functions. The model architecture is pretty much the same as Wav2Vec2. |
🐛 Describe the bug
[This is not really a bug report, more a request for clarification/discussion...]
I'm training a HuBERT BASE-sized model, first iteration targeting 100 clusters of MFCC, on a custom 1000-hr dataset (i.e. similar size to Librispeech). See the intriguing plots below:
Note that the triangles in the Tensorboard plots indicate there were NaN values. I'm perplexed as to why that leads the loss curve and unit prediction accuracies to have a "hiccup" around step 180K, but then seem to recover by step 200K. My hypothesis is that it's related to the special treatment of feature penalty and layer normalization in the BASE-sized models.
As noted in the Wav2Vec2 paper:
Inspecting the HuBERT code, it's worth clarifying that the L2 "feature penalty" is in fact always included in the loss function, and it is scaled up by a hardcoded factor of 10x -- regardless of dataset or model size, and also irrespective of any masking -- although the 10x downscaling of the feature encoder gradients and the specialized layer normalization are only enabled in configurations for BASE-sized models. So I think a perhaps more straightforward interpretation is that the feature penalty is effectively unscaled for BASE models, and 10x upscaled for LARGE/XLARGE models. Am I reading that correctly?
So my first question: what changes might be suggested to avoid the loss "hiccup" that I've observed? Should I try adjusting the feature penalty scale and/or enabling standard layer normalization in the BASE model configuration?
My second concern is the drop in unmasked accuracy, which seems to start declining somewhat prior to the peak learning rate warmed up by step 20K. I suspect this is because the implementation of the HuBERT loss function does not give any weight to the unmasked logits. The HuBERT paper explored weightings of 0.0, 0.5, 1.0 and found that it was generally best to give zero weight to the unmasked loss component, especially when the targets are relatively low quality in terms of phonemic correlation. However, I wonder: might it be worthwhile to consider some small but non-zero weighting, say 0.1, for the unmasked loss, to prevent the under-fitting dip seen in these plots?
I also wonder about the length normalization when combining masked and unmasked losses. It seems that the current TorchAudio implementation will first combine these weighted masked and unmasked losses (which are summed from masked and unmasked logits of different lengths, depending on the masking parameterization), and add the feature penalty (averaged over all frames, irrespective of masking; this is later scaled by the length of masked logits) before later normalizing the overall summation of losses by the length of the masked logits. By contrast, the fairseq implementation would normalize by the sum of lengths of the masked plus unmasked logits (i.e. the full sequence length) if the weight of the unmasked loss is non-zero. Should the TorchAudio implementation be updated to match the fairseq implementation?
Moreover, I wonder: would it be a sensible improvement to instead normalize the masked and unmasked losses by their respective lengths prior to their weighted summation, and to also compute the feature penalty with respect to the weighting of the masked and unmasked losses (e.g unmasked feature frames should not contribute to the penalty if the unmasked weight is zero)? The advantage of this is that the weighting becomes decoupled from the effect of the masking parameterization and thus it's easier to tune this hyperparameter independently.
Versions
I'm using a slightly modified local fork of the main branch. The principal change is to refactor the
training_step
to (re-)enableautomatic_optimization=True
in Lightning (specifically for the gradient accumulation functionality, see #2918), rather than having a manual backward step.The text was updated successfully, but these errors were encountered: