Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Speech Examples] Add pytorch speech pretraining #13877

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Oct 5, 2021

What does this PR do?

This PR adds a Wav2Vec2 pretraining example for PyTorch. Some experiments are still running, but it would be great if the PR could already be reviewed.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@patrickvonplaten patrickvonplaten changed the title [Speech] Add pytorch speech pretraining [WIP][Speech] Add pytorch speech pretraining Oct 5, 2021
if accelerator.state.num_processes > 1:
dist.all_reduce(num_losses)
gradient_multiplier = accelerator.state.num_processes / num_losses
multiply_grads(model.module.parameters(), gradient_multiplier)
Copy link
Contributor Author

@patrickvonplaten patrickvonplaten Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was quite important to stabilize Wav2Vec2 pre-training and it's due to the fact that vanilla PyTorch DDP simply averages all gradients over all devices even though the gradients on each device can have very much different importances. E.g. if my gradients on device 1 were computed with 200 losses but on device 2 with just 20 then the gradients on device 1 should have 10 times more weight than the gradients on device 2. In this example script it is solved as follows:

    1. sum the losses on each device instead of averaging and compute gradient on each device
    1. sum losses over all devices
    1. multiply DDP-averaged gradients afterwards with num_devices / num_total_losses

=> not doing this led to unstable / breaking training. Fairseq also has this "correct" gradient averaging implemented here: https://github.com/pytorch/fairseq/blob/dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3/fairseq/trainer.py#L844

We don't have this functionality in Trainer as far as I'm aware and I don't think it's always needed as usually the number of losses per device is more or less the same. However, in the case of speech-pretraining this can lead to huge differences, i.e.

  • We can't blockify the speech inputs and audio files have very different noise backgrounds -> therefore we end up with samples as long as 20 seconds and others only of length 2 seconds with neither 2 seconds nor 20 sections being exceptions.
  • The large model can only fit batch_size 1-4 per device so that the samples can't really "average each other out" on each device
    => therefore it's quite important to correctly scale the gradients afterward

I could see similar problems in long-range summarization and other NLP tasks... @sgugger, @stas00 do you guys think we should/could support this natively in Trainer somehow or is it even desired? Maybe we should leave a note in the docs no?

@stas00 - would it be easy to implement the gradient averaging in deepspeed as well? How should it be handled there?

Copy link
Contributor Author

@patrickvonplaten patrickvonplaten Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would group_by_length solve this problem for distributed training? @sgugger

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this code only works for distributed multi-GPU (note that you should use accelerator.gather(num_losses) to gather all the losses then sum or mean (not sure which one you want) to have an implementation that also works on TPUs for free.

For support in the Trainer, let's wait for after the course please. Last time there was a much tinier modification in the loss computation, we had plenty of things breaking after the release and lot of work to fix/patch that I don't have to do right now :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group_by_length might hide the problem a bit, yes, since the samples will have similar lengths. Do you have the unstability with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I still had it, but should probably test again (there were also some other problems before...)

For DDP I wasn't sure whether group_by_length also makes sure that a batch that is distributed over multiple devices has examples of the same length or if it's just "grouped" per device - it should be "grouped" over all devices in DDP no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be easy to implement the gradient averaging in deepspeed as well? How should it be handled there?

I have never tried. Perhaps let's cross this bridge once this is being worked on in the Trainer?

I can meanwhile ask how to best approach that, but it'd help to know how this is going to be approached in the Trainer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be "grouped" over all devices in DDP no?

Yes, it will be grouped over all devices :-)

@@ -115,19 +120,21 @@ class Wav2Vec2ForPreTrainingOutput(ModelOutput):
codevector_perplexity: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
contrastive_loss: Optional[torch.FloatTensor] = None
diversity_loss: Optional[torch.FloatTensor] = None


def _compute_mask_indices(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is changed in a breaking way, but it was always private and never used outside of the model for non-pretraining tasks, so that should be fine.

Still have to verify that all the ASR downstream fine-tuning still gives same good results? Do we have to check audio-classification as well - does it make use of mask_time_prob? @anton-l

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mask_time_prob is enabled then yes, classification also uses it as it's the base model's feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASR downstream fine-tuning gives the same good results.

@patrickvonplaten patrickvonplaten changed the title [WIP][Speech] Add pytorch speech pretraining [Speech Examples] Add pytorch speech pretraining Oct 6, 2021
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice new example! Thanks a lot for writing it!

@patrickvonplaten
Copy link
Contributor Author

Merge after: #13961

tests/test_modeling_wav2vec2.py Show resolved Hide resolved
@@ -115,19 +120,21 @@ class Wav2Vec2ForPreTrainingOutput(ModelOutput):
codevector_perplexity: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
contrastive_loss: Optional[torch.FloatTensor] = None
diversity_loss: Optional[torch.FloatTensor] = None


def _compute_mask_indices(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mask_time_prob is enabled then yes, classification also uses it as it's the base model's feature.

@patrickvonplaten patrickvonplaten merged commit d45fc7d into huggingface:master Oct 11, 2021
@patrickvonplaten patrickvonplaten deleted the add_pytorch_speech_pretraining branch October 11, 2021 22:46
@stas00
Copy link
Contributor

stas00 commented Oct 17, 2021

question: why the other examples were not ported over?

The deepspeed tests use wav2vec2/run_asr.py so I'm not quite sure how to do the transition. This is one of the most non-standard models so it'd be great to be able to continue running deepspeed tests.

Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
* adapt wav2vec2

* add example

* add files

* adapt

* remove bogus file

* Apply suggestions from code review

* adapt files more

* upload changes

* del old files

* up

* up

* up

* up

* up

* correct gradient checkpoitning

* add readme

* finish

* finish

* up

* more fixes

* up

* up

* add demo run to readme

* up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants