-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training baseline model doesn't converge #75
Comments
Can you obtain correct results (metrics and visualizations) by running inference with the released checkpoint? |
I can't seem to find any pretrained model, only pretrained resnet50 weights. |
ok found it now: ckpt |
Hi @linxuewu I would really appreciate any idea you may have on this one? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I'm trying to train the model with the provided config (R50 256x704) with code pulled on July 10, 2024.
I'm using 4 A100 GPU's with total batch size 48.
With the original LR 6e-4 the training diverges and grad_norm goes NaN after ~20 epochs.
When I lower the LR to 4e-4 the loss goes down and grad_norm is ok but the final model has AP=0 for all classes after 100 epochs.
I read through all the open and closed issue. Checked that the resnet50 pretrain is loading successfully.
using the default aug_config:
data_aug_conf = {
"resize_lim": (0.40, 0.47),
"final_dim": input_shape[::-1],
"bot_pct_lim": (0.0, 0.0),
"rot_lim": (-5.4, 5.4),
"H": 900,
"W": 1600,
"rand_flip": True,
"rot3d_range": [-0.3925, 0.3925],
}
Any ideas why my training doesn't converge?
The text was updated successfully, but these errors were encountered: