Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorBoard and training/validation during training #119

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

drewoldag
Copy link
Collaborator

@drewoldag drewoldag commented Nov 13, 2024

Two big changes

Tensorboard

We're using tensorboardX to write out metrics to be consumed by the tensorboard UI.

Example of Tensorboard

The following is an example Tensorboard from a 30 epoch run of the example CNN using the cifar dataset. Here the top charts are loss calculated over every 10 batches for training and the overall loss calculated by the validation set at the end of each epoch, The lower charts show GPU utilization and memory usage as percentages every 0.1 seconds from beginning to end of training.
tensorboard_2

Note that I've also brought in a new dependency here, GPUtil that seems well used, and should be to use even when Nvidia GPUs aren't being used. It simply won't log any metrics.

Validation during training

Until now running fibad train would train a model over a given dataset for N epochs. Traditionally in ML, a user would run a validation dataset through the model at the end of each epoch to gauge performance and watch for over fitting.

This PR begins to implement that approach such that when fibad instantiates a dataset in train mode, the dataset will have two pytorch Samplers defined as class attributes (only for CIFAR dataset right now) These samplers will be passed to two pytorch dataloaders producing a train and a validation data loader.

Then in train.py we create two pytorch ignite engines, (one for training over N epochs and one for validating at the end of each epoch) The two are linked together using ignites .on(...) hooks such that the validator engine runs over the validation dataset at the end of each epoch. We only create a validation Engine if there is a validation dataset defined.

Places where I could use some help

Train and Validation Datasets

I haven't touched hsc_data_set.py. There's a lot going on in there to define the train/validation/test data subsets, and I haven't yet gone down that path. I'm hopeful that it will be trivial to create pytorch SubsetRandomSamplers in that dataset, which will plug directly into the rest of the machinery. If it's not trivial, then, of course, we can either adapt the machinery or adapt HSCDataset.

Validation runs the model in train mode

When running the validation dataset through the validation engine, we should really set the model to .eval() mode, however, that isn't happening right now. The open question is where to do that. Perhaps we could use the STARTED and COMPLETED hooks of the validation engine to set the model to .eval() and then to .train() mode???

All the functions called by hooks

This may or may not be an issue - but there are a several different Engine events that trigger function calls. The function definitions and the code that adds the functions to the hooks is all mushed together in create_trainer and create_validator. Maybe that's fine??? But it feels very messy.

@drewoldag drewoldag self-assigned this Nov 13, 2024
Copy link

codecov bot commented Nov 13, 2024

Codecov Report

Attention: Patch coverage is 6.59341% with 85 lines in your changes missing coverage. Please review.

Project coverage is 39.28%. Comparing base (7b3b992) to head (4db10a9).

Files with missing lines Patch % Lines
src/fibad/pytorch_ignite.py 0.00% 36 Missing ⚠️
src/fibad/gpu_monitor.py 0.00% 22 Missing ⚠️
src/fibad/data_sets/example_cifar_data_set.py 12.50% 14 Missing ⚠️
src/fibad/train.py 0.00% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #119      +/-   ##
==========================================
- Coverage   40.95%   39.28%   -1.67%     
==========================================
  Files          21       22       +1     
  Lines        1697     1774      +77     
==========================================
+ Hits          695      697       +2     
- Misses       1002     1077      +75     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Nov 13, 2024

Before [7b3b992] After [b252cde] Ratio Benchmark (Parameter)
3.16±0.8s 2.70±1s ~0.85 benchmarks.time_computation
984 472 0.48 benchmarks.mem_list

Click here to view all benchmarks.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@drewoldag drewoldag changed the title WIP - introducing tensorboard to fibad WIP - tensorboard and validation during training Nov 16, 2024
…py`. Replaced pytorch-ignite tensorboard writer with tensorboardX.
@drewoldag drewoldag marked this pull request as ready for review November 26, 2024 22:49
@@ -94,10 +94,22 @@ filters = false
# Implementation is dataset class dependent. Default is false meaning now filtering.
filter_catalog = false

#??? Maybe these values belong here, instead of a separate [prepare] section???
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection from me, we just need to transplant their documentation and references.

@@ -160,7 +174,71 @@ def log_total_time(evaluator):
return evaluator


def create_trainer(model: torch.nn.Module, config: ConfigDict, results_directory: Path) -> Engine:
#! There will likely be a significant amount of code duplication between the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally agree, but not sure it should block first implementation.

@drewoldag drewoldag changed the title WIP - tensorboard and validation during training TensorBoard and training/validation during training Nov 26, 2024
@mtauraso
Copy link
Collaborator

Places where I could use some help

Train and Validation Datasets

I haven't touched hsc_data_set.py. There's a lot going on in there to define the train/validation/test data subsets, and I haven't yet gone down that path. I'm hopeful that it will be trivial to create pytorch SubsetRandomSamplers in that dataset, which will plug directly into the rest of the machinery. If it's not trivial, then, of course, we can either adapt the machinery or adapt HSCDataset.

I concur on this being easy. HSCDataset is just a Map-style dataset that supports __getitem__() In our case __getitem__() has been performant-enough so far. If it's not in this case we can tackle that directly, and that is somewhat independent of what code we need to write to connect HSCDataset with SubsetRandomSamplers

Validation runs the model in train mode

When running the validation dataset through the validation engine, we should really set the model to .eval() mode, however, that isn't happening right now. The open question is where to do that. Perhaps we could use the STARTED and COMPLETED hooks of the validation engine to set the model to .eval() and then to .train() mode???

This seems not wrong to me. Is there a clear alternative?

All the functions called by hooks

This may or may not be an issue - but there are a several different Engine events that trigger function calls. The function definitions and the code that adds the functions to the hooks is all mushed together in create_trainer and create_validator. Maybe that's fine??? But it feels very messy.

It is messy, but afaict this flows from ignite's execution model: They own execution, we own creating objects and hooks for them to execute. Ultimately we have to build those objects somewhere. I think the best we can hope for here is that the construction of the objects in our code is somewhat well organized and well-described.

Copy link
Collaborator

@mtauraso mtauraso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like it. I think you have some good points about the ugliness in places, but I'm not convinced any of them should hold up the current implementation.

@drewoldag
Copy link
Collaborator Author

Regarding setting the model to be in train vs. eval mode, there's likely a nice way to do it. The difficult part right now is the train_step definition in the model class. Both the trainer and validator Engines use that method.

If the assorted documentation that I've seen, it's generally the case that the validation engine would use a method other than train_step. For instance, what is shown in the first two example code blocks on this page: https://pytorch.org/ignite/generated/ignite.engine.engine.Engine.html#engine

I'll make notes and issues associated with each of the points here just so they aren't forgotten. Before I merge, I'll move the docstrings and update the references to train/validate/test keys in the config.toml file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants