Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] PL mvp0: training #748

Closed
wants to merge 1 commit into from
Closed

[feat] PL mvp0: training #748

wants to merge 1 commit into from

Conversation

hackgoofer
Copy link
Contributor

@hackgoofer hackgoofer commented Jan 23, 2021

  • MVP 0. Training: Goal - Train a model from scratch and reach similar accuracy as using mmf_trainer
    • Setup the training pipeline: done
    • Training on the right device: done
    • Clip gradients: done
    • Optimizer: done
    • FP16 Support: done
    • LR scheduler (incl. warmup etc): done
    • testcase: train visual_bert on vqa from scratch fo 10 iterations, compare the value: done
  • tests included in this PR (tests are only done for pytorch lightning integration):
    • Vanilla Training w/o grad accumulate, make sure loss for 5 iters are the same between mmf and pl
      • Optimizer working as intended as a part of this PR.
    • max_updates and max_epochs calculation
    • Training with grad accumulate
    • Training with LR schedule achieves a different value compared to without LR schedule
    • Training with LR schedule for PL is the same as training with LR schedule for mmf_tranier
    • Training with gradient clipping make sure all grads are within the grad_clipping threshold
    • Training with gradient clipping is the same as training with gradient clipping for mmf_trainer

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jan 23, 2021
@SeanNaren
Copy link

SeanNaren commented Jan 26, 2021

Apologies on the delay getting Lightning-AI/pytorch-lightning#4369 merged, a test was added so it should be merged today and added to the release!

EDIT: @ytsheng this has been merged! It's been included in our latest release, for Lightning (1.1.6), so just need to pip install pytorch-lightning -U

@hackgoofer
Copy link
Contributor Author

hackgoofer commented Jan 26, 2021

Thanks so much @SeanNaren, updated the PR to reflect the new version from pytorch lightning. Many thanks!

@SeanNaren
Copy link

I gave the PR a read-over and looks clean, nice work! I saw some custom builders in a few places for optimizers/schedulers, have you guys thought of using the hydra instantiation methods: https://hydra.cc/docs/next/patterns/instantiate_objects/overview

I know hydra instantiations is not to everyones tastes :) Regardless integration looks great!

@hackgoofer
Copy link
Contributor Author

I gave the PR a read-over and looks clean, nice work! I saw some custom builders in a few places for optimizers/schedulers, have you guys thought of using the hydra instantiation methods: https://hydra.cc/docs/next/patterns/instantiate_objects/overview

I know hydra instantiations is not to everyones tastes :) Regardless integration looks great!

Hydra is definitely in our pipeline of things to do! We have it planned for H1 of 2021. Stay tuned.

@hackgoofer hackgoofer changed the title [feat] pytorch lightning integration - training [feat] PL mvp0 - training Jan 29, 2021
@hackgoofer hackgoofer changed the title [feat] PL mvp0 - training [feat] PL mvp0: training Jan 29, 2021
@hackgoofer hackgoofer changed the base branch from pl_3 to master January 30, 2021 03:15
loss = report.losses["loss"].detach().cpu().item()
self.assertAlmostEqual(loss, 2.6852, 4)
self.assertAlmostEqual(loss, 4.4688, 4)
Copy link
Contributor Author

@hackgoofer hackgoofer Jan 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was necessary because I changed the loss calculation to force gradients to be big to test grad clipping.

Copy link
Contributor

@apsdehal apsdehal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super! This is going great. This needs to be imported internally and then TARGETS needed to be added in fbcode before landing. I have left some general comments on design.

mmf/configs/defaults.yaml Show resolved Hide resolved
mmf/datasets/base_dataset.py Outdated Show resolved Hide resolved
mmf/datasets/lightning_datamodule.py Show resolved Hide resolved
mmf/models/base_model.py Outdated Show resolved Hide resolved
mmf/models/base_model.py Outdated Show resolved Hide resolved
mmf/trainers/lightning_trainer.py Outdated Show resolved Hide resolved
mmf/trainers/lightning_trainer.py Outdated Show resolved Hide resolved
mmf/trainers/lightning_trainer.py Outdated Show resolved Hide resolved
mmf/trainers/lightning_trainer.py Outdated Show resolved Hide resolved
tests/lightning/test_lightning_trainer.py Outdated Show resolved Hide resolved
@hackgoofer hackgoofer force-pushed the pl_mvp0 branch 4 times, most recently from 92d668f to 9347a80 Compare February 2, 2021 03:24
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ytsheng has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ytsheng has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ytsheng has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@apsdehal apsdehal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Almost there.

mmf/datasets/lightning_datamodule.py Outdated Show resolved Hide resolved
mmf/datasets/lightning_datamodule.py Outdated Show resolved Hide resolved
mmf/trainers/lightning_core/loop_callback.py Outdated Show resolved Hide resolved
mmf/trainers/lightning_trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@vedanuj vedanuj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. Some comments to address before landing.



class LightningLoopCallback(Callback):
def __init__(self, lightning_trainer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing typings.

mmf/trainers/lightning_core/loop_callback.py Outdated Show resolved Hide resolved
mmf/trainers/lightning_core/loop_callback.py Show resolved Hide resolved
mmf/trainers/lightning_trainer.py Outdated Show resolved Hide resolved
mmf/trainers/lightning_trainer.py Outdated Show resolved Hide resolved
mmf/trainers/lightning_trainer.py Outdated Show resolved Hide resolved
tests/lightning/__init__.py Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@ytsheng has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ytsheng has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@vedanuj vedanuj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for addressing all the comments.

@facebook-github-bot
Copy link
Contributor

@ytsheng has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ytsheng has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ytsheng has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ytsheng merged this pull request in 0ee1127.

@hackgoofer hackgoofer deleted the pl_mvp0 branch February 8, 2021 20:41
ultrons pushed a commit to ultrons/mmf that referenced this pull request Feb 8, 2021
Summary:
* pytorch lighting stub mostly involving training
* Tests for lightning trainer included
* built on top of the mmf grad accumulation fix: facebookresearch#747

- [X] MVP 0. Training: Goal - Train a model from scratch and reach similar accuracy as using mmf_trainer
   - [X] Setup the training pipeline: done
   - [X] Training on the right device: done
   - [X] Clip gradients: done
   - [X] Optimizer: done
   - [X] FP16 Support: done
   - [X] LR scheduler (incl. warmup etc): done
   - [X] testcase: train visual_bert on vqa from scratch fo 10 iterations, compare the value: done
- [x] tests included in this PR (tests are only done for pytorch lightning integration):
   - [X] Vanilla Training w/o grad accumulate, make sure loss for 5 iters are the same between mmf and pl
      - [X] Optimizer working as intended as a part of this PR.
   - [X] `max_updates` and `max_epochs` calculation
   - [x] Training with grad accumulate
   - [x] Training with LR schedule achieves a different value compared to without LR schedule
   - [x] Training with LR schedule for PL is the same as training with LR schedule for `mmf_tranier`
   - [x] Training with gradient clipping make sure all grads are within the `grad_clipping` threshold
   - [x] Training with gradient clipping is the same as training with gradient clipping for `mmf_trainer`

Pull Request resolved: facebookresearch#748

Reviewed By: apsdehal, simran2905

Differential Revision: D26192869

Pulled By: ytsheng

fbshipit-source-id: 203a91e893d6b878bbed80ed84960dd059cfc90c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants