Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

Validation during training (version 2) #828

Merged
merged 4 commits into from
Sep 29, 2019

Conversation

osanwe
Copy link

@osanwe osanwe commented May 27, 2019

After the discussion in another issue thread (#785) and additional code studying, I decided to simplify the approach for tracking validation during training for earlier stopping possibility.

Instead of an additional dataset, the parameter SOLVER.TEST_PERIOD (as SOLVER.CHECKPOINT_PERIOD) is added for specifying the iterations number for validation logging. The same datasets are used for intermediate and final evaluations.

Also in this version both losses and AP (with inference method) are calculated.

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label May 27, 2019
meters_val.update(loss=losses_reduced, **loss_dict_reduced)
synchronize()
logger.info(
meters.delimiter.join(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should meters be meters_val ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not matter here because meters and meters_val have the same delimiter, but yes, ideally meters_val should be here. I'll fix this.

SHARE_BOX_FEATURE_EXTRACTOR: False
MASK_ON: True
DATASETS:
TRAIN: ("coco_2014_train", "coco_2014_valminusminival")
Copy link

@qihao-huang qihao-huang May 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we have to add two data sets in TRAIN: ("coco_2014_train", "coco_2014_valminusminival").

Only one data set will be returned in maskrcnn_benchmark/data/build.py 's function build_dataset:

  # for training, concatenate all datasets into a single one
    dataset = datasets[0]
    if len(datasets) > 1:
        dataset = D.ConcatDataset(datasets)
    return [dataset]

datasets is a list, so dataset is coco_2014_train, right?

And, Question 2:
Why you delete the VAL? From my perspective view, TEST is TEST, VAL is VAL. They are different distribution data set, right?

Thank you so much for your work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding to Question 1:
In the highlighted code snippet datasets are concatenated if there are more than 1 dataset in the TRAIN field.

Regarding to Question 2:
As discussed in #785 (proposed by @fmassa) in this case a separate validation dataset is needed rarely because you do not change hyperparameters when a training script works. After network tuning you can get the best model variant (evaluated on validation dataset which is TEST here) and run tools/test_net.py with another dataset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patient and nice reply : )

@xiaohai12
Copy link

sorry, Is the validation part already merged into code?

@osanwe
Copy link
Author

osanwe commented Jun 14, 2019

@xiaohai12,
Unfortunately not. Waiting for @fmassa's review I think.

@botcs botcs merged commit 0ce8f6f into facebookresearch:master Sep 29, 2019
@botcs
Copy link
Contributor

botcs commented Sep 29, 2019

Hi @osanwe,

Thanks for this awesome PR, this one's extremely useful!🎉

@chenjoya
Copy link
Contributor

chenjoya commented Oct 8, 2019

Thanks for your implementation. But after evaluation, the training is stopped. This is strange.
Case:
When I train maskrcnn_R_50_FPN_1x, I set the period = 30000. In 30000 iterations, the program evaluates the AP on COCO minival. But then, the training seems to be stopped.
Hope your attention.

@botcs
Copy link
Contributor

botcs commented Oct 8, 2019

@chenjoya does it stop without any errors?

@elepherai
Copy link

@chenjoya Maybe it's CUDA out-of-memory problem.

WEIGHT_DECAY: 0.0001
STEPS: (60000, 80000)
MAX_ITER: 90000
TEST_PERIOD: 2500

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this isn't a silly question, but can you explain why you made the decision to change the BASE_LR : 0.02 (default: 0.001), WEIGHT_DECAY: 0.0001(default: 0.0005) and STEPS:(60000, 80000)? If these have been answered in a previous issue, I wouldn't mind being pointed to that discussion. Thank you for you time!

@chenjoya
Copy link
Contributor

Sorry. That's may be my CPU resource is not enough. Now it works well. Thanks! ^∀^

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants