Model_choice.py: refactor model, loss and optimizer instantiation and loading #292

remtav · 2022-03-21T15:58:45Z

Model_choice.py has gotten messy and cluttered over the years and needs a bit of refactoring. This refactoring should be done before addressing #246 and #152.

Current state of things

set_hyperparamers() function has very vague prupose of "set[ting] hyperparameters based on values provided in yaml config file";
net() function is supposed to "Define the neural net", but in reality it's an all-in-one vague function that does the following:

Defines net architecture;
Reads a checkpoint to memory with load_checkpoint() (from a .pth.tar file as created by torch.save);
Returns if net() is called from inference or continues with the following if net() is called from train mode:
If more than one gpu is requested, determines which gpus are available based on user-inputted threshold for GPU's available RAM and usage %;
Sets model to DataParallel if more than one gpu is requested and available;
Sets main device with set_device() function;
Pushes model to main device;
Calls set_hyperparamters() (see above);
Pushes loss to device;
Returns 7 (!!) objects: model, model_name, loss, etc.

Suggested solution (high level)

All these steps could be better separated in small, dedicated functions of their own:

read_checkpoint(): renamed version of load_checkpoint (prevents confusion with load_state_dict function). Although it derives from torch.load(checkpoint)'s function, this function really just reads a checkpoint in memory from a .pth.tar file to a Python dict containing weights, optimizer, etc.
define_net_architecture(): define the model architecture from config parameters (i.e. create model with randomly initialized weights)
adapt_checkpoint_to_dp_model(): for use at test loop during training only, adapts a generic checkpoint to be loaded to a DataParallel model as is done in load_from_checkpoint (if model is DataParallel object)
define_loss(): calls verify_weights() and instantiates a loss criterion
define_optimizer(): instantiates optimizer with learning rate, weight decay, etc.

These functions would be called only when necessary in 3 main places:

Beginning of train_segmentation:

read checkpoint to be loaded checkpoint for model weights and optimizer;
define net architecture;
load model weights with pytorch's [model_object].load_state_dict() method;
define loss;
define optimizer;
load optimizer from checkpoint;

Test loop in train_segmentation:

Load best checkpoint to model (adapt checkpoint keys if model is a nn.DataParallel instance using dedicated function);

Beginning of inference:

override architecture, input bands, output classes from checkpoint's params;
define net architecture
load weights from provided checkpoint to model using pytorch's [model_object].load_state_dict() method

adapt train_segmentation.py and inference_segmentation.py to new usage move state_dict_path param to default_training.yaml implement unit tests for model_choice.py functions read_checkpoint(): add robustness (covers external checkpoints with only model weights, and complies to torch's save key naming standard 'model_state_dict' and 'optimizer_state_dict' rather than gdl's 'model' and 'optimizer' keys create high level define_model() function using all low level models definition/loading functions from model_choice.py test_losses.py: implement class weights test softcode strict loading boolean for loading provided state_dict at train_segmentation.py

* refactor model_choice.py using solution in issue #292 adapt train_segmentation.py and inference_segmentation.py to new usage move state_dict_path param to default_training.yaml implement unit tests for model_choice.py functions read_checkpoint(): add robustness (covers external checkpoints with only model weights, and complies to torch's save key naming standard 'model_state_dict' and 'optimizer_state_dict' rather than gdl's 'model' and 'optimizer' keys create high level define_model() function using all low level models definition/loading functions from model_choice.py test_losses.py: implement class weights test softcode strict loading boolean for loading provided state_dict at train_segmentation.py * bugfix for github actions * more bugfixes for github actions * train_segmentation.py: bugfix --> read weights under 'model_state_dict' key

…RCan#294) * refactor model_choice.py using solution in issue NRCan#292 adapt train_segmentation.py and inference_segmentation.py to new usage move state_dict_path param to default_training.yaml implement unit tests for model_choice.py functions read_checkpoint(): add robustness (covers external checkpoints with only model weights, and complies to torch's save key naming standard 'model_state_dict' and 'optimizer_state_dict' rather than gdl's 'model' and 'optimizer' keys create high level define_model() function using all low level models definition/loading functions from model_choice.py test_losses.py: implement class weights test softcode strict loading boolean for loading provided state_dict at train_segmentation.py * bugfix for github actions * more bugfixes for github actions * train_segmentation.py: bugfix --> read weights under 'model_state_dict' key

remtav mentioned this issue Mar 21, 2022

refactor model_choice.py using solution in issue #292 and more #294

Merged

remtav closed this as completed in #294 Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model_choice.py: refactor model, loss and optimizer instantiation and loading #292

Model_choice.py: refactor model, loss and optimizer instantiation and loading #292

remtav commented Mar 21, 2022

Model_choice.py: refactor model, loss and optimizer instantiation and loading #292

Model_choice.py: refactor model, loss and optimizer instantiation and loading #292

Comments

remtav commented Mar 21, 2022

Current state of things

Suggested solution (high level)