-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorBoard and training/validation during training #119
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #119 +/- ##
==========================================
- Coverage 40.95% 39.28% -1.67%
==========================================
Files 21 22 +1
Lines 1697 1774 +77
==========================================
+ Hits 695 697 +2
- Misses 1002 1077 +75 ☔ View full report in Codecov by Sentry. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
…py`. Replaced pytorch-ignite tensorboard writer with tensorboardX.
src/fibad/fibad_default_config.toml
Outdated
@@ -94,10 +94,22 @@ filters = false | |||
# Implementation is dataset class dependent. Default is false meaning now filtering. | |||
filter_catalog = false | |||
|
|||
#??? Maybe these values belong here, instead of a separate [prepare] section??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objection from me, we just need to transplant their documentation and references.
@@ -160,7 +174,71 @@ def log_total_time(evaluator): | |||
return evaluator | |||
|
|||
|
|||
def create_trainer(model: torch.nn.Module, config: ConfigDict, results_directory: Path) -> Engine: | |||
#! There will likely be a significant amount of code duplication between the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally agree, but not sure it should block first implementation.
I concur on this being easy. HSCDataset is just a Map-style dataset that supports
This seems not wrong to me. Is there a clear alternative?
It is messy, but afaict this flows from ignite's execution model: They own execution, we own creating objects and hooks for them to execute. Ultimately we have to build those objects somewhere. I think the best we can hope for here is that the construction of the objects in our code is somewhat well organized and well-described. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like it. I think you have some good points about the ugliness in places, but I'm not convinced any of them should hold up the current implementation.
Regarding setting the model to be in train vs. eval mode, there's likely a nice way to do it. The difficult part right now is the If the assorted documentation that I've seen, it's generally the case that the validation engine would use a method other than I'll make notes and issues associated with each of the points here just so they aren't forgotten. Before I merge, I'll move the docstrings and update the references to train/validate/test keys in the config.toml file. |
Two big changes
Tensorboard
We're using
tensorboardX
to write out metrics to be consumed by the tensorboard UI.Example of Tensorboard
The following is an example Tensorboard from a 30 epoch run of the example CNN using the cifar dataset. Here the top charts are loss calculated over every 10 batches for training and the overall loss calculated by the validation set at the end of each epoch, The lower charts show GPU utilization and memory usage as percentages every 0.1 seconds from beginning to end of training.
Note that I've also brought in a new dependency here,
GPUtil
that seems well used, and should be to use even when Nvidia GPUs aren't being used. It simply won't log any metrics.Validation during training
Until now running
fibad train
would train a model over a given dataset for N epochs. Traditionally in ML, a user would run a validation dataset through the model at the end of each epoch to gauge performance and watch for over fitting.This PR begins to implement that approach such that when fibad instantiates a dataset in
train
mode, the dataset will have two pytorch Samplers defined as class attributes (only for CIFAR dataset right now) These samplers will be passed to two pytorch dataloaders producing atrain
and avalidation
data loader.Then in
train.py
we create two pytorch ignite engines, (one for training over N epochs and one for validating at the end of each epoch) The two are linked together using ignites.on(...)
hooks such that the validator engine runs over the validation dataset at the end of each epoch. We only create a validation Engine if there is a validation dataset defined.Places where I could use some help
Train and Validation Datasets
I haven't touched hsc_data_set.py. There's a lot going on in there to define the train/validation/test data subsets, and I haven't yet gone down that path. I'm hopeful that it will be trivial to create pytorch
SubsetRandomSamplers
in that dataset, which will plug directly into the rest of the machinery. If it's not trivial, then, of course, we can either adapt the machinery or adapt HSCDataset.Validation runs the model in
train
modeWhen running the validation dataset through the validation engine, we should really set the model to.eval()
mode, however, that isn't happening right now. The open question is where to do that. Perhaps we could use theSTARTED
andCOMPLETED
hooks of the validation engine to set the model to.eval()
and then to.train()
mode???All the functions called by hooks
This may or may not be an issue - but there are a several different Engine events that trigger function calls. The function definitions and the code that adds the functions to the hooks is all mushed together in
create_trainer
andcreate_validator
. Maybe that's fine??? But it feels very messy.