[Speech Examples] Add pytorch speech pretraining #13877

patrickvonplaten · 2021-10-05T14:14:00Z

What does this PR do?

This PR adds a Wav2Vec2 pretraining example for PyTorch. Some experiments are still running, but it would be great if the PR could already be reviewed.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…into add_pytorch_speech_pretraining

src/transformers/models/wav2vec2/modeling_wav2vec2.py

examples/pytorch/speech-pretraining/README.md

examples/pytorch/_tests_requirements.txt

patrickvonplaten · 2021-10-06T12:36:33Z

examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py

+ if accelerator.state.num_processes > 1:
+ dist.all_reduce(num_losses)
+ gradient_multiplier = accelerator.state.num_processes / num_losses
+ multiply_grads(model.module.parameters(), gradient_multiplier)


This was quite important to stabilize Wav2Vec2 pre-training and it's due to the fact that vanilla PyTorch DDP simply averages all gradients over all devices even though the gradients on each device can have very much different importances. E.g. if my gradients on device 1 were computed with 200 losses but on device 2 with just 20 then the gradients on device 1 should have 10 times more weight than the gradients on device 2. In this example script it is solved as follows:

sum the losses on each device instead of averaging and compute gradient on each device

sum losses over all devices

multiply DDP-averaged gradients afterwards with num_devices / num_total_losses

=> not doing this led to unstable / breaking training. Fairseq also has this "correct" gradient averaging implemented here: https://github.com/pytorch/fairseq/blob/dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3/fairseq/trainer.py#L844

We don't have this functionality in Trainer as far as I'm aware and I don't think it's always needed as usually the number of losses per device is more or less the same. However, in the case of speech-pretraining this can lead to huge differences, i.e.

We can't blockify the speech inputs and audio files have very different noise backgrounds -> therefore we end up with samples as long as 20 seconds and others only of length 2 seconds with neither 2 seconds nor 20 sections being exceptions.

The large model can only fit batch_size 1-4 per device so that the samples can't really "average each other out" on each device
=> therefore it's quite important to correctly scale the gradients afterward

I could see similar problems in long-range summarization and other NLP tasks... @sgugger, @stas00 do you guys think we should/could support this natively in Trainer somehow or is it even desired? Maybe we should leave a note in the docs no?

@stas00 - would it be easy to implement the gradient averaging in deepspeed as well? How should it be handled there?

Also would group_by_length solve this problem for distributed training? @sgugger

Note that this code only works for distributed multi-GPU (note that you should use accelerator.gather(num_losses) to gather all the losses then sum or mean (not sure which one you want) to have an implementation that also works on TPUs for free.

For support in the Trainer, let's wait for after the course please. Last time there was a much tinier modification in the loss computation, we had plenty of things breaking after the release and lot of work to fix/patch that I don't have to do right now :-)

group_by_length might hide the problem a bit, yes, since the samples will have similar lengths. Do you have the unstability with it?

yeah I still had it, but should probably test again (there were also some other problems before...)

For DDP I wasn't sure whether group_by_length also makes sure that a batch that is distributed over multiple devices has examples of the same length or if it's just "grouped" per device - it should be "grouped" over all devices in DDP no?

should it be easy to implement the gradient averaging in deepspeed as well? How should it be handled there?

I have never tried. Perhaps let's cross this bridge once this is being worked on in the Trainer?

I can meanwhile ask how to best approach that, but it'd help to know how this is going to be approached in the Trainer.

it should be "grouped" over all devices in DDP no?

Yes, it will be grouped over all devices :-)

examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py

patrickvonplaten · 2021-10-06T12:41:07Z

src/transformers/models/wav2vec2/modeling_wav2vec2.py

@@ -115,19 +120,21 @@ class Wav2Vec2ForPreTrainingOutput(ModelOutput):
 codevector_perplexity: torch.FloatTensor = None
 hidden_states: Optional[Tuple[torch.FloatTensor]] = None
 attentions: Optional[Tuple[torch.FloatTensor]] = None
+ contrastive_loss: Optional[torch.FloatTensor] = None
+ diversity_loss: Optional[torch.FloatTensor] = None


 def _compute_mask_indices(


This function is changed in a breaking way, but it was always private and never used outside of the model for non-pretraining tasks, so that should be fine.

Still have to verify that all the ASR downstream fine-tuning still gives same good results? Do we have to check audio-classification as well - does it make use of mask_time_prob? @anton-l

If mask_time_prob is enabled then yes, classification also uses it as it's the base model's feature.

ASR downstream fine-tuning gives the same good results.

src/transformers/models/wav2vec2/modeling_wav2vec2.py

tests/test_modeling_wav2vec2.py

sgugger

Nice new example! Thanks a lot for writing it!

examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py

patrickvonplaten · 2021-10-11T11:08:15Z

Merge after: #13961

tests/test_modeling_wav2vec2.py

src/transformers/models/wav2vec2/modeling_wav2vec2.py

anton-l · 2021-10-11T14:55:21Z

src/transformers/models/wav2vec2/modeling_wav2vec2.py

@@ -115,19 +120,21 @@ class Wav2Vec2ForPreTrainingOutput(ModelOutput):
 codevector_perplexity: torch.FloatTensor = None
 hidden_states: Optional[Tuple[torch.FloatTensor]] = None
 attentions: Optional[Tuple[torch.FloatTensor]] = None
+ contrastive_loss: Optional[torch.FloatTensor] = None
+ diversity_loss: Optional[torch.FloatTensor] = None


 def _compute_mask_indices(


If mask_time_prob is enabled then yes, classification also uses it as it's the base model's feature.

stas00 · 2021-10-17T00:24:17Z

question: why the other examples were not ported over?

The deepspeed tests use wav2vec2/run_asr.py so I'm not quite sure how to do the transition. This is one of the most non-standard models so it'd be great to be able to continue running deepspeed tests.

* adapt wav2vec2 * add example * add files * adapt * remove bogus file * Apply suggestions from code review * adapt files more * upload changes * del old files * up * up * up * up * up * correct gradient checkpoitning * add readme * finish * finish * up * more fixes * up * up * add demo run to readme * up

patrickvonplaten added 6 commits October 5, 2021 11:10

adapt wav2vec2

512a0ec

Merge branch 'master' of https://github.com/huggingface/transformers …

e9395bc

…into add_pytorch_speech_pretraining

add example

94b9bf8

add files

f2344dc

adapt

6404e83

remove bogus file

3886e45

patrickvonplaten commented Oct 5, 2021

View reviewed changes

src/transformers/models/wav2vec2/modeling_wav2vec2.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 5, 2021

View reviewed changes

src/transformers/models/wav2vec2/modeling_wav2vec2.py Outdated Show resolved Hide resolved

patrickvonplaten added 2 commits October 5, 2021 16:18

Apply suggestions from code review

f9a82ea

adapt files more

4c8913f

patrickvonplaten changed the title ~~[Speech] Add pytorch speech pretraining~~ [WIP][Speech] Add pytorch speech pretraining Oct 5, 2021

patrickvonplaten added 11 commits October 6, 2021 11:20

upload changes

eb3cf71

del old files

5f4019e

up

3a76384

up

0384b8e

up

deb949f

up

d1bf7f8

up

54fec73

correct gradient checkpoitning

5d02cf9

add readme

dfbe20e

finish

8b6422d

finish

0791247