Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training baseline model doesn't converge #75

Open
shauloron opened this issue Jul 23, 2024 · 5 comments
Open

Training baseline model doesn't converge #75

shauloron opened this issue Jul 23, 2024 · 5 comments

Comments

@shauloron
Copy link

Hi,
I'm trying to train the model with the provided config (R50 256x704) with code pulled on July 10, 2024.
I'm using 4 A100 GPU's with total batch size 48.
With the original LR 6e-4 the training diverges and grad_norm goes NaN after ~20 epochs.
When I lower the LR to 4e-4 the loss goes down and grad_norm is ok but the final model has AP=0 for all classes after 100 epochs.
I read through all the open and closed issue. Checked that the resnet50 pretrain is loading successfully.
using the default aug_config:
data_aug_conf = {
"resize_lim": (0.40, 0.47),
"final_dim": input_shape[::-1],
"bot_pct_lim": (0.0, 0.0),
"rot_lim": (-5.4, 5.4),
"H": 900,
"W": 1600,
"rand_flip": True,
"rot3d_range": [-0.3925, 0.3925],
}

Any ideas why my training doesn't converge?

@linxuewu
Copy link
Collaborator

Can you obtain correct results (metrics and visualizations) by running inference with the released checkpoint?

@shauloron
Copy link
Author

I can't seem to find any pretrained model, only pretrained resnet50 weights.
Can you please point me to the pretrained weights you released?
Thanks

@shauloron
Copy link
Author

shauloron commented Jul 24, 2024

ok found it now: ckpt
I'll check how it works out.

@shauloron
Copy link
Author

I've tried using the checkpoint above and get different results than the ones in the tutorial notebook visualizations
To start the anchor map doesn't look the same:
image
and the detection results are partial:
image

any ideas?

@shauloron
Copy link
Author

Hi @linxuewu I would really appreciate any idea you may have on this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants