-
Notifications
You must be signed in to change notification settings - Fork 26.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Speech Examples] Add pytorch speech pretraining #13877
[Speech Examples] Add pytorch speech pretraining #13877
Conversation
…into add_pytorch_speech_pretraining
if accelerator.state.num_processes > 1: | ||
dist.all_reduce(num_losses) | ||
gradient_multiplier = accelerator.state.num_processes / num_losses | ||
multiply_grads(model.module.parameters(), gradient_multiplier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was quite important to stabilize Wav2Vec2 pre-training and it's due to the fact that vanilla PyTorch DDP simply averages all gradients over all devices even though the gradients on each device can have very much different importances. E.g. if my gradients on device 1 were computed with 200 losses but on device 2 with just 20 then the gradients on device 1 should have 10 times more weight than the gradients on device 2. In this example script it is solved as follows:
-
- sum the losses on each device instead of averaging and compute gradient on each device
-
- sum losses over all devices
-
- multiply DDP-averaged gradients afterwards with num_devices / num_total_losses
=> not doing this led to unstable / breaking training. Fairseq also has this "correct" gradient averaging implemented here: https://github.com/pytorch/fairseq/blob/dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3/fairseq/trainer.py#L844
We don't have this functionality in Trainer
as far as I'm aware and I don't think it's always needed as usually the number of losses per device is more or less the same. However, in the case of speech-pretraining this can lead to huge differences, i.e.
- We can't blockify the speech inputs and audio files have very different noise backgrounds -> therefore we end up with samples as long as 20 seconds and others only of length 2 seconds with neither 2 seconds nor 20 sections being exceptions.
- The large model can only fit batch_size 1-4 per device so that the samples can't really "average each other out" on each device
=> therefore it's quite important to correctly scale the gradients afterward
I could see similar problems in long-range summarization and other NLP tasks... @sgugger, @stas00 do you guys think we should/could support this natively in Trainer somehow or is it even desired? Maybe we should leave a note in the docs no?
@stas00 - would it be easy to implement the gradient averaging in deepspeed
as well? How should it be handled there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also would group_by_length
solve this problem for distributed training? @sgugger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this code only works for distributed multi-GPU (note that you should use accelerator.gather(num_losses)
to gather all the losses then sum or mean (not sure which one you want) to have an implementation that also works on TPUs for free.
For support in the Trainer
, let's wait for after the course please. Last time there was a much tinier modification in the loss computation, we had plenty of things breaking after the release and lot of work to fix/patch that I don't have to do right now :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group_by_length
might hide the problem a bit, yes, since the samples will have similar lengths. Do you have the unstability with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I still had it, but should probably test again (there were also some other problems before...)
For DDP I wasn't sure whether group_by_length
also makes sure that a batch that is distributed over multiple devices has examples of the same length or if it's just "grouped" per device - it should be "grouped" over all devices in DDP no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be easy to implement the gradient averaging in deepspeed as well? How should it be handled there?
I have never tried. Perhaps let's cross this bridge once this is being worked on in the Trainer?
I can meanwhile ask how to best approach that, but it'd help to know how this is going to be approached in the Trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be "grouped" over all devices in DDP no?
Yes, it will be grouped over all devices :-)
examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
Show resolved
Hide resolved
@@ -115,19 +120,21 @@ class Wav2Vec2ForPreTrainingOutput(ModelOutput): | |||
codevector_perplexity: torch.FloatTensor = None | |||
hidden_states: Optional[Tuple[torch.FloatTensor]] = None | |||
attentions: Optional[Tuple[torch.FloatTensor]] = None | |||
contrastive_loss: Optional[torch.FloatTensor] = None | |||
diversity_loss: Optional[torch.FloatTensor] = None | |||
|
|||
|
|||
def _compute_mask_indices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is changed in a breaking way, but it was always private and never used outside of the model for non-pretraining tasks, so that should be fine.
Still have to verify that all the ASR downstream fine-tuning still gives same good results? Do we have to check audio-classification
as well - does it make use of mask_time_prob
? @anton-l
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If mask_time_prob
is enabled then yes, classification also uses it as it's the base model's feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASR downstream fine-tuning gives the same good results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice new example! Thanks a lot for writing it!
examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
Outdated
Show resolved
Hide resolved
examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
Outdated
Show resolved
Hide resolved
Merge after: #13961 |
@@ -115,19 +120,21 @@ class Wav2Vec2ForPreTrainingOutput(ModelOutput): | |||
codevector_perplexity: torch.FloatTensor = None | |||
hidden_states: Optional[Tuple[torch.FloatTensor]] = None | |||
attentions: Optional[Tuple[torch.FloatTensor]] = None | |||
contrastive_loss: Optional[torch.FloatTensor] = None | |||
diversity_loss: Optional[torch.FloatTensor] = None | |||
|
|||
|
|||
def _compute_mask_indices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If mask_time_prob
is enabled then yes, classification also uses it as it's the base model's feature.
question: why the other examples were not ported over? The deepspeed tests use |
* adapt wav2vec2 * add example * add files * adapt * remove bogus file * Apply suggestions from code review * adapt files more * upload changes * del old files * up * up * up * up * up * correct gradient checkpoitning * add readme * finish * finish * up * more fixes * up * up * add demo run to readme * up
What does this PR do?
This PR adds a Wav2Vec2 pretraining example for PyTorch. Some experiments are still running, but it would be great if the PR could already be reviewed.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.