Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] max_batches support to training log and tqdm progress bar. #1554

Merged
merged 3 commits into from
Oct 23, 2023

Conversation

hakuryuu96
Copy link
Contributor

Issue description

As @BloodAxe described, if training_hyperparams.max_train/valid_batches are redefined in CLI, tqdm and training log do not take changes into account and continue showing the full length of the dataloader. E.g. when user executes something like this

train_params = {
    "max_epochs": 300,
    "phase_callbacks": phase_callbacks,
    "initial_lr": lr,
    "loss": loss_fn,
    "optimizer": optimizer,
    "train_metrics_list": [Accuracy(), Top5()],
    "valid_metrics_list": [Accuracy(), Top5()],
    "metric_to_watch": "Accuracy",
    "greater_metric_to_watch_is_better": True,
    "lr_scheduler_step_type": "epoch",
    "max_train_batches": 24,
    "max_valid_batches": 24,
}

trainer.train(model=net, training_params=train_params, train_loader=train_loader, valid_loader=valid_loader)

the resulting logs and progress bar are the following:

[2023-10-19 13:30:41] INFO - sg_trainer_utils.py - TRAINING PARAMETERS:
    - Mode:                         Single GPU
    - Number of GPUs:               1          (2 available on the machine)
    - Full dataset size:            50000      (len(train_set))
    - Batch size per GPU:           16         (batch_size)
    - Batch Accumulate:             1          (batch_accumulate)
    - Total batch size:             16         (num_gpus * batch_size)
    - Effective Batch size:         16         (num_gpus * batch_size * batch_accumulate)
    - Iterations per epoch:         3125         (len(train_loader))
    - Gradient updates per epoch:   3125         (len(train_loader) / batch_accumulate)

[2023-10-19 13:30:41] WARNING - sg_trainer_utils.py - max_train_batch is set to 24. This limits the number of iterations per epoch and gradient updates per epoch.
[2023-10-19 13:30:41] INFO - sg_trainer.py - Started training for 300 epochs (0/299)

Train epoch 0: 1%|█                        | 24/3125 [00:02<00:00,  9.22it/s, Accuracy=0.0729, CrossEntropyLoss=2.53, Top5=0.508, gpu_mem=0.231]
Validating: 1%|█                      | 24/3125 [00:00<00:00, 126.12it/s]
2023-10-19 13:30:44] INFO - base_sg_logger.py - Checkpoint saved in /home/phil/deci/super-gradients/checkpoints/Cifar10_external_objects_example/RUN_20231019_133041_486426/ckpt_best.pth
[2023-10-19 13:30:44] INFO - sg_trainer.py - Best checkpoint overriden: validation Accuracy: 0.1002604141831398
===========================================================
SUMMARY OF EPOCH 0
├── Train
│   ├── Crossentropyloss = 2.5337
│   ├── Accuracy = 0.0729
│   └── Top5 = 0.5078
└── Validation
    ├── Crossentropyloss = 2.3486
    ├── Accuracy = 0.1003
    └── Top5 = 0.4596

===========================================================

PR description

This PR addresses the issue above and proposes some improvements. Briefly:

  1. Logs show the actual number of used elements in dataloader
    1.5. Additionally, logs are warning user that the max_batches parameter was set.
  2. Progress bar shows the actual number of steps it will take to finish the epoch

E.g. if the max_train/valid_batches parameter is specified:

[2023-10-19 13:30:41] INFO - sg_trainer_utils.py - TRAINING PARAMETERS:
    - Mode:                         Single GPU
    - Number of GPUs:               1          (2 available on the machine)
    - Full dataset size:            50000      (len(train_set))
    - Batch size per GPU:           16         (batch_size)
    - Batch Accumulate:             1          (batch_accumulate)
    - Total batch size:             16         (num_gpus * batch_size)
    - Effective Batch size:         16         (num_gpus * batch_size * batch_accumulate)
    - Iterations per epoch:         24         (len(train_loader) OR max_train_batches)
    - Gradient updates per epoch:   24         (len(train_loader) OR max_train_batches / batch_accumulate)

[2023-10-19 13:30:41] WARNING - sg_trainer_utils.py - max_train_batch is set to 24. This limits the number of iterations per epoch and gradient updates per epoch.
[2023-10-19 13:30:41] INFO - sg_trainer.py - Started training for 300 epochs (0/299)

Train epoch 0: 100%|██████████| 24/24 [00:02<00:00,  9.22it/s, Accuracy=0.0729, CrossEntropyLoss=2.53, Top5=0.508, gpu_mem=0.231]
Validating: 100%|██████████| 24/24 [00:00<00:00, 126.12it/s]
[2023-10-19 13:30:44] INFO - base_sg_logger.py - Checkpoint saved in /home/phil/deci/super-gradients/checkpoints/Cifar10_external_objects_example/RUN_20231019_133041_486426/ckpt_best.pth
[2023-10-19 13:30:44] INFO - sg_trainer.py - Best checkpoint overriden: validation Accuracy: 0.1002604141831398

If not, the logs behave similarly to previous versions.

Some ideas

IMO it should be cool to consider logging the whole set of training parameters before the run. For me as a user, it would be nice to double-check all the settings I've made somewhere in the project (e.g. if I use hydra and take SG Trainer class to my pipeline) and to be sure things go smoothly :)
For example, the user should see the following info:

  • dataset (number of classes, class names, etc)
  • dataloader (batch_size, num_workers, etc)
  • model (model name, number of trainable parameters, etc)
  • optimization info (optimizer, lr, wd, losses, etc)
  • training info (number of epochs, number of gradient updates, EMA, SyncBN usage, etc)

@hakuryuu96 hakuryuu96 requested a review from BloodAxe October 20, 2023 14:28
Copy link
Contributor

@BloodAxe BloodAxe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Louis-Dupont Louis-Dupont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BloodAxe BloodAxe merged commit 749a9c7 into Deci-AI:master Oct 23, 2023
3 checks passed
BloodAxe pushed a commit that referenced this pull request Oct 26, 2023
…ar. (#1554)

* Added max_batches support to training log and tqdm progress bar.

* Added changing string in accordance which parameter is used (len(loader) of max_batches)

* Replaced stopping condition for the epoch with a smaller one

(cherry picked from commit 749a9c7)
BloodAxe added a commit that referenced this pull request Oct 26, 2023
* [Improvement] max_batches support to training log and tqdm progress bar. (#1554)

* Added max_batches support to training log and tqdm progress bar.

* Added changing string in accordance which parameter is used (len(loader) of max_batches)

* Replaced stopping condition for the epoch with a smaller one

(cherry picked from commit 749a9c7)

* fix (#1558)

Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com>
(cherry picked from commit 8a1d255)

* fix (#1564)

(cherry picked from commit 24798b0)

* Bugfix of model.export() to work correct with bs>1 (#1551)

(cherry picked from commit 0515496)

* Fixed incorrect automatic variable used (#1565)

$@ is the name of the target being generated, and $^ are the dependencies

Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com>
(cherry picked from commit 43f8bea)

* fix typo in class documentation (#1548)

Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com>
Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com>
(cherry picked from commit ec21383)

* Feature/sg 1198 mixed precision automatically changed with warning (#1567)

* fix

* work with tmpdir

* minor change of comment

* improve device_config

(cherry picked from commit 34fda6c)

* Fixed issue with torch 1.12 where _scale_fn_ref is missing in CyclicLR (#1575)

(cherry picked from commit 23b4f7a)

* Fixed issue with torch 1.12 issue with arange not supporting fp16 for CPU device. (#1574)

(cherry picked from commit 1f15c76)

---------

Co-authored-by: hakuryuu96 <marchenkophilip@gmail.com>
Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com>
Co-authored-by: Alessandro Ros <aler9.dev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants