TensorBoard and training/validation during training #119

drewoldag · 2024-11-13T00:14:24Z

Two big changes

Tensorboard

We're using tensorboardX to write out metrics to be consumed by the tensorboard UI.

Example of Tensorboard

The following is an example Tensorboard from a 30 epoch run of the example CNN using the cifar dataset. Here the top charts are loss calculated over every 10 batches for training and the overall loss calculated by the validation set at the end of each epoch, The lower charts show GPU utilization and memory usage as percentages every 0.1 seconds from beginning to end of training.

Note that I've also brought in a new dependency here, GPUtil that seems well used, and should be to use even when Nvidia GPUs aren't being used. It simply won't log any metrics.

Validation during training

Until now running fibad train would train a model over a given dataset for N epochs. Traditionally in ML, a user would run a validation dataset through the model at the end of each epoch to gauge performance and watch for over fitting.

This PR begins to implement that approach such that when fibad instantiates a dataset in train mode, the dataset will have two pytorch Samplers defined as class attributes (only for CIFAR dataset right now) These samplers will be passed to two pytorch dataloaders producing a train and a validation data loader.

Then in train.py we create two pytorch ignite engines, (one for training over N epochs and one for validating at the end of each epoch) The two are linked together using ignites .on(...) hooks such that the validator engine runs over the validation dataset at the end of each epoch. We only create a validation Engine if there is a validation dataset defined.

Places where I could use some help

Train and Validation Datasets

I haven't touched hsc_data_set.py. There's a lot going on in there to define the train/validation/test data subsets, and I haven't yet gone down that path. I'm hopeful that it will be trivial to create pytorch SubsetRandomSamplers in that dataset, which will plug directly into the rest of the machinery. If it's not trivial, then, of course, we can either adapt the machinery or adapt HSCDataset.

Validation runs the model in `train` mode

When running the validation dataset through the validation engine, we should really set the model to .eval() mode, however, that isn't happening right now. The open question is where to do that. Perhaps we could use the STARTED and COMPLETED hooks of the validation engine to set the model to .eval() and then to .train() mode???

All the functions called by hooks

This may or may not be an issue - but there are a several different Engine events that trigger function calls. The function definitions and the code that adds the functions to the hooks is all mushed together in create_trainer and create_validator. Maybe that's fine??? But it feels very messy.

codecov · 2024-11-13T00:17:15Z

Codecov Report

Attention: Patch coverage is 6.59341% with 85 lines in your changes missing coverage. Please review.

Project coverage is 39.28%. Comparing base (7b3b992) to head (4db10a9).

Files with missing lines	Patch %	Lines
src/fibad/pytorch_ignite.py	0.00%	36 Missing ⚠️
src/fibad/gpu_monitor.py	0.00%	22 Missing ⚠️
src/fibad/data_sets/example_cifar_data_set.py	12.50%	14 Missing ⚠️
src/fibad/train.py	0.00%	13 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #119      +/-   ##
==========================================
- Coverage   40.95%   39.28%   -1.67%     
==========================================
  Files          21       22       +1     
  Lines        1697     1774      +77     
==========================================
+ Hits          695      697       +2     
- Misses       1002     1077      +75

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-11-13T00:18:08Z

Before [`7b3b992`]	After [`b252cde`]	Ratio	Benchmark (Parameter)
3.16±0.8s	2.70±1s	~0.85	benchmarks.time_computation
984	472	0.48	benchmarks.mem_list

Click here to view all benchmarks.

…ation. (#117)

review-notebook-app · 2024-11-13T20:15:00Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

… training process.

…py`. Replaced pytorch-ignite tensorboard writer with tensorboardX.

mtauraso · 2024-11-26T22:52:37Z

src/fibad/fibad_default_config.toml

@@ -94,10 +94,22 @@ filters = false
 # Implementation is dataset class dependent. Default is false meaning now filtering.
 filter_catalog = false

+#??? Maybe these values belong here, instead of a separate [prepare] section???


No objection from me, we just need to transplant their documentation and references.

mtauraso · 2024-11-26T22:54:28Z

src/fibad/pytorch_ignite.py

@@ -160,7 +174,71 @@ def log_total_time(evaluator):
    return evaluator


-def create_trainer(model: torch.nn.Module, config: ConfigDict, results_directory: Path) -> Engine:
+#! There will likely be a significant amount of code duplication between the


Generally agree, but not sure it should block first implementation.

mtauraso · 2024-11-26T23:01:53Z

Places where I could use some help

Train and Validation Datasets

I haven't touched hsc_data_set.py. There's a lot going on in there to define the train/validation/test data subsets, and I haven't yet gone down that path. I'm hopeful that it will be trivial to create pytorch SubsetRandomSamplers in that dataset, which will plug directly into the rest of the machinery. If it's not trivial, then, of course, we can either adapt the machinery or adapt HSCDataset.

I concur on this being easy. HSCDataset is just a Map-style dataset that supports __getitem__() In our case __getitem__() has been performant-enough so far. If it's not in this case we can tackle that directly, and that is somewhat independent of what code we need to write to connect HSCDataset with SubsetRandomSamplers

Validation runs the model in train mode

When running the validation dataset through the validation engine, we should really set the model to .eval() mode, however, that isn't happening right now. The open question is where to do that. Perhaps we could use the STARTED and COMPLETED hooks of the validation engine to set the model to .eval() and then to .train() mode???

This seems not wrong to me. Is there a clear alternative?

All the functions called by hooks

This may or may not be an issue - but there are a several different Engine events that trigger function calls. The function definitions and the code that adds the functions to the hooks is all mushed together in create_trainer and create_validator. Maybe that's fine??? But it feels very messy.

It is messy, but afaict this flows from ignite's execution model: They own execution, we own creating objects and hooks for them to execute. Ultimately we have to build those objects somewhere. I think the best we can hope for here is that the construction of the objects in our code is somewhat well organized and well-described.

mtauraso

Overall I like it. I think you have some good points about the ugliness in places, but I'm not convinced any of them should hold up the current implementation.

…tion ends.

drewoldag · 2024-11-26T23:24:48Z

Regarding setting the model to be in train vs. eval mode, there's likely a nice way to do it. The difficult part right now is the train_step definition in the model class. Both the trainer and validator Engines use that method.

If the assorted documentation that I've seen, it's generally the case that the validation engine would use a method other than train_step. For instance, what is shown in the first two example code blocks on this page: https://pytorch.org/ignite/generated/ignite.engine.engine.Engine.html#engine

I'll make notes and issues associated with each of the points here just so they aren't forgotten. Before I merge, I'll move the docstrings and update the references to train/validate/test keys in the config.toml file.

…able.

WIP - First attempt using pytorch-ignite built in tensorboard handler.

07c4534

drewoldag self-assigned this Nov 13, 2024

Updating the "train a model" demo notebook, adding it to the document…

03c71f5

…ation. (#117)

drewoldag added 2 commits November 13, 2024 20:42

Adding a couple more output handlers for tensorboard.

879232a

Fewer tensorboard metrics for now, adding validation-at-each-epoch to…

5afffe6

… training process.

drewoldag changed the title ~~WIP - introducing tensorboard to fibad~~ WIP - tensorboard and validation during training Nov 16, 2024

Added GPU monitor, removed all pytorch-ignite references from `train.…

423b139

…py`. Replaced pytorch-ignite tensorboard writer with tensorboardX.

drewoldag marked this pull request as ready for review November 26, 2024 22:49

drewoldag requested review from mtauraso and aritraghsh09 November 26, 2024 22:49

mtauraso reviewed Nov 26, 2024

View reviewed changes

drewoldag changed the title ~~WIP - tensorboard and validation during training~~ TensorBoard and training/validation during training Nov 26, 2024

mtauraso approved these changes Nov 26, 2024

View reviewed changes

Setting model to .eval when validation starts, and .train when valida…

7efd4b3

…tion ends.

drewoldag and others added 3 commits November 26, 2024 15:48

Moving train/validate/test_size keys and comments into the data_set t…

e708224

…able.

Merge branch 'main' into issue/118/support-tensorboard

cdcb171

Fixing precommit failure.

4db10a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorBoard and training/validation during training #119

TensorBoard and training/validation during training #119

drewoldag commented Nov 13, 2024 •

edited

Loading

codecov bot commented Nov 13, 2024 •

edited

Loading

github-actions bot commented Nov 13, 2024 •

edited

Loading

review-notebook-app bot commented Nov 13, 2024

mtauraso Nov 26, 2024

mtauraso Nov 26, 2024

mtauraso commented Nov 26, 2024

Places where I could use some help

Train and Validation Datasets

Validation runs the model in `train` mode

All the functions called by hooks

mtauraso left a comment

drewoldag commented Nov 26, 2024

TensorBoard and training/validation during training #119

Are you sure you want to change the base?

TensorBoard and training/validation during training #119

Conversation

drewoldag commented Nov 13, 2024 • edited Loading

Two big changes

Tensorboard

Example of Tensorboard

Validation during training

Places where I could use some help

Train and Validation Datasets

Validation runs the model in train mode

All the functions called by hooks

codecov bot commented Nov 13, 2024 • edited Loading

Codecov Report

github-actions bot commented Nov 13, 2024 • edited Loading

review-notebook-app bot commented Nov 13, 2024

mtauraso Nov 26, 2024

Choose a reason for hiding this comment

mtauraso Nov 26, 2024

Choose a reason for hiding this comment

mtauraso commented Nov 26, 2024

Places where I could use some help

Train and Validation Datasets

Validation runs the model in train mode

All the functions called by hooks

mtauraso left a comment

Choose a reason for hiding this comment

drewoldag commented Nov 26, 2024

drewoldag commented Nov 13, 2024 •

edited

Loading

Validation runs the model in `train` mode

codecov bot commented Nov 13, 2024 •

edited

Loading

github-actions bot commented Nov 13, 2024 •

edited

Loading

Validation runs the model in `train` mode