MultiGPU training performance problem #5006

unrue · 2023-06-20T12:05:30Z

unrue
Jun 20, 2023

Hi,

I'm using a dataset with about 3500 images for training and 400 for validation, 50 classes. I'm running in a Multigpu environment (HPC machine). At the moment the global AP is about 18. The following are settings gives best AP:

cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml"))
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml") 
cfg.TEST.EVAL_PERIOD = 50
cfg.SOLVER.IMS_PER_BATCH = 16
cfg.SOLVER.BASE_LR = 0.05
cfg.SOLVER.MAX_ITER = 5000
cfg.SOLVER.STEPS = [4000,4500]
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 50

These are the AP and execution time among 2,4,8 GPUs:

2 GPU: AP: 18.6 -- 55 minutes
4 GPU: AP: 18.2 -- 57 minutes
8 GPU: AP: 18 -- 56 minutes

So my question are:

Why the execution time is the same when I run with 2,4,8 GPU? (Gpus are running, I've checked)
How can I improve the Global AP, quite low at the moment? I tried a lot of Learning rate, iteration and LR decay (using cfg.SOLVER.STEPS), with no success.
How can I stop my training when the AP is not increasing for a lot of iterations? Or in general, which is the best criteria to stop training when the iterations are too much?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiGPU training performance problem #5006

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

MultiGPU training performance problem #5006

unrue Jun 20, 2023

Replies: 0 comments

unrue
Jun 20, 2023