Training baseline model doesn't converge #75

shauloron · 2024-07-23T06:07:01Z

Hi,
I'm trying to train the model with the provided config (R50 256x704) with code pulled on July 10, 2024.
I'm using 4 A100 GPU's with total batch size 48.
With the original LR 6e-4 the training diverges and grad_norm goes NaN after ~20 epochs.
When I lower the LR to 4e-4 the loss goes down and grad_norm is ok but the final model has AP=0 for all classes after 100 epochs.
I read through all the open and closed issue. Checked that the resnet50 pretrain is loading successfully.
using the default aug_config:
data_aug_conf = {
"resize_lim": (0.40, 0.47),
"final_dim": input_shape[::-1],
"bot_pct_lim": (0.0, 0.0),
"rot_lim": (-5.4, 5.4),
"H": 900,
"W": 1600,
"rand_flip": True,
"rot3d_range": [-0.3925, 0.3925],
}

Any ideas why my training doesn't converge?

linxuewu · 2024-07-23T07:26:07Z

Can you obtain correct results (metrics and visualizations) by running inference with the released checkpoint?

shauloron · 2024-07-24T11:17:27Z

I can't seem to find any pretrained model, only pretrained resnet50 weights.
Can you please point me to the pretrained weights you released?
Thanks

shauloron · 2024-07-24T11:24:48Z

ok found it now: ckpt
I'll check how it works out.

shauloron · 2024-07-24T14:50:57Z

I've tried using the checkpoint above and get different results than the ones in the tutorial notebook visualizations
To start the anchor map doesn't look the same:

and the detection results are partial:

any ideas?

shauloron · 2024-07-28T05:47:28Z

Hi @linxuewu I would really appreciate any idea you may have on this one?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training baseline model doesn't converge #75

Training baseline model doesn't converge #75

shauloron commented Jul 23, 2024

linxuewu commented Jul 23, 2024

shauloron commented Jul 24, 2024

shauloron commented Jul 24, 2024 •

edited

Loading

shauloron commented Jul 24, 2024

shauloron commented Jul 28, 2024

Training baseline model doesn't converge #75

Training baseline model doesn't converge #75

Comments

shauloron commented Jul 23, 2024

linxuewu commented Jul 23, 2024

shauloron commented Jul 24, 2024

shauloron commented Jul 24, 2024 • edited Loading

shauloron commented Jul 24, 2024

shauloron commented Jul 28, 2024

shauloron commented Jul 24, 2024 •

edited

Loading