-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Validation during training (version 2) #828
Conversation
maskrcnn_benchmark/engine/trainer.py
Outdated
meters_val.update(loss=losses_reduced, **loss_dict_reduced) | ||
synchronize() | ||
logger.info( | ||
meters.delimiter.join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should meters
be meters_val
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not matter here because meters
and meters_val
have the same delimiter
, but yes, ideally meters_val
should be here. I'll fix this.
SHARE_BOX_FEATURE_EXTRACTOR: False | ||
MASK_ON: True | ||
DATASETS: | ||
TRAIN: ("coco_2014_train", "coco_2014_valminusminival") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we have to add two data sets in TRAIN: ("coco_2014_train", "coco_2014_valminusminival").
Only one data set will be returned in maskrcnn_benchmark/data/build.py
's function build_dataset
:
# for training, concatenate all datasets into a single one
dataset = datasets[0]
if len(datasets) > 1:
dataset = D.ConcatDataset(datasets)
return [dataset]
datasets is a list, so dataset is coco_2014_train
, right?
And, Question 2:
Why you delete the VAL
? From my perspective view, TEST is TEST, VAL is VAL. They are different distribution data set, right?
Thank you so much for your work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding to Question 1:
In the highlighted code snippet datasets are concatenated if there are more than 1 dataset in the TRAIN
field.
Regarding to Question 2:
As discussed in #785 (proposed by @fmassa) in this case a separate validation dataset is needed rarely because you do not change hyperparameters when a training script works. After network tuning you can get the best model variant (evaluated on validation dataset which is TEST
here) and run tools/test_net.py
with another dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patient and nice reply : )
sorry, Is the validation part already merged into code? |
@xiaohai12, |
Hi @osanwe, Thanks for this awesome PR, this one's extremely useful!🎉 |
Thanks for your implementation. But after evaluation, the training is stopped. This is strange. |
@chenjoya does it stop without any errors? |
@chenjoya Maybe it's CUDA out-of-memory problem. |
WEIGHT_DECAY: 0.0001 | ||
STEPS: (60000, 80000) | ||
MAX_ITER: 90000 | ||
TEST_PERIOD: 2500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this isn't a silly question, but can you explain why you made the decision to change the BASE_LR : 0.02 (default: 0.001), WEIGHT_DECAY: 0.0001(default: 0.0005) and STEPS:(60000, 80000)? If these have been answered in a previous issue, I wouldn't mind being pointed to that discussion. Thank you for you time!
Sorry. That's may be my CPU resource is not enough. Now it works well. Thanks! ^∀^ |
After the discussion in another issue thread (#785) and additional code studying, I decided to simplify the approach for tracking validation during training for earlier stopping possibility.
Instead of an additional dataset, the parameter
SOLVER.TEST_PERIOD
(asSOLVER.CHECKPOINT_PERIOD
) is added for specifying the iterations number for validation logging. The same datasets are used for intermediate and final evaluations.Also in this version both losses and AP (with
inference
method) are calculated.