Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Everytime I restart training from the begining, the obtained model gaves a larger error. #38

Closed
jialuwang123321 opened this issue Jan 15, 2021 · 10 comments

Comments

@jialuwang123321
Copy link

Dear Mr. Samarth Brahmbhatt
Thank you for your code! It is fantastic. However, I repeated three times, starting all over again (without resume checkpoint), training 0 to 100 epochs with mapnet, and using eval.py to test the trained model.

In theory, I train in strict accordance with the provided experimental parameters and compiling environment. The three pieces of training are completely independent and do not affect each other.
So the test results (for example, using eval.py to test the 100_epoch.pth.tar of the first, second, and third training respectively), the results should be almost the same.
Unfortunately, the trained model (e.g. epoch_100.pth.tar) obtained in the second time's training gave an obviously larger error than that from the first-time's training. So do the third-time's results comparing with the second and the first time's ones.

I feel very confused. I'd like to ask for your opinion. Your suggestion will be very helpful to me. Thank you in advance!

Best
Jialu

@jialuwang123321 jialuwang123321 changed the title Everytime I restart training from the begining, the obtained model gaes a larger error. Everytime I restart training from the begining, the obtained model gaves a larger error. Jan 15, 2021
@jialuwang123321
Copy link
Author

jialuwang123321 commented Jan 15, 2021

I am using torch 0.4.1, cuda9.2, python3.7.3, ubuntu 18.4

My command for
training:
python train.py --dataset RobotCar --scene loop --config_file configs/mapnet.ini --model mapnet --device 0 --learn_beta --learn_gamma

testing:
python eval.py --dataset RobotCar --scene loop --model mapnet --weights /project/scripts/logs/RobotCar_loop_mapnet_mapnet_learn_beta_learn_gamma/base/epoch_100.pth.tar --config_file configs/mapnet.ini --val

config file is:

[training]
n_epochs = 100
batch_size = 20
do_val = no
seed = 7
shuffle = yes
num_workers = 5
snapshot = 5
val_freq = 50
max_grad_norm = 0

[optimization]
opt = adam
lr = 1e-4
weight_decay = 0.0005
;momentum = 0.9
;lr_decay = 0.1
;lr_stepvalues = [60, 80]

[logging]
visdom = no
print_freq = 20

[hyperparameters]
beta = -3.0
gamma = -3.0
dropout = 0.5
skip = 10
variable_skip = no
real = no
steps = 3
color_jitter = 0.7

@samarth-robo
Copy link
Contributor

Hi @jialuwang123321 , thanks for using the code!

It is a strange issue. Some suggestions:

  • Make absolutely sure that you are not resuming from the previous training's checkpoint (looks like you are not resuming, based on the training command you mentioned)
  • See if pose_stats.txt somehow has significantly different values every training? If that is true, and then you use an outdated pose_stats.txt for evaluation, that might produce a high evaluation error. Every training run overwrites pose_stats.txt (see this line), but the process is not random, so theoretically the values should remain unchanged. But worth checking.

@jialuwang123321
Copy link
Author

jialuwang123321 commented Jan 15, 2021

Huge thanks for your answer! I checked the pose_stats.txt and it changed. (see below).
I am modifying and retrain again.

5736290.7242747 620253.5877489 109.5124567
110.8290592 99.4226409 0.8314870

BTW, I am confused about the reason why it changed. I used 2014-06-26-09-24-58 for both training and evaluation. But I changed two things:

  1. I set color_jitter = 0 in config file to stop using colorjitter for training
  2. I used https://github.com/ori-mrg/robotcar-dataset-sdk to changed the robotcar dataset's images into colorful ones in advance
    Do you think it would be the reason why the pose_stats.txt changed?
    Thank you again for your patient help!

@samarth-robo
Copy link
Contributor

@jialuwang123321 you provided one value of pose_stats.txt. But you should look at whether it changes significantly for every training. Also, the values you posted are not significantly different from the included pose_stats.txt.

@jialuwang123321
Copy link
Author

According to your suggestion, I checked it for every training. I found it remains unchanged for my training and testing dataset.

@samarth-robo
Copy link
Contributor

@jialuwang123321 OK then we can rule out stats.txt as the cause of this issue. I don't have other suggestions, unfortunately. How large is the monotonic increase in error?

@jialuwang123321
Copy link
Author

For example, when evaluating the 100_epoch.pth.tar, which was obtained by independently training mapnet from 0 to 100 epochs for three times, I got obviously different results:

First time
Error in translation: median 4.37 m, mean 6.13 m
Error in rotation: median 2.65 degrees, mean 3.51 degree

Second time
Error in translation: median 255.56 m, mean 256.39 m
Error in rotation: median 118.63 degrees, mean 114.35 degree

Third time
Error in translation: median 155.40 m, mean 159.66 m
Error in rotation: median 135.14 degrees, mean 129.25 degree

@samarth-robo
Copy link
Contributor

Oh, so it is not increasing monotonically. Is the training process is somehow changing your images, pose labels, or config files on the disk? Is backpropagation somehow disabled after the first run?

@jialuwang123321
Copy link
Author

jialuwang123321 commented Jan 19, 2021

The ideas you offer are really helpful! I think it's very likely that I accidentally modified the code, causing the back propagation to be blocked
But because I was not sure where it was accidentally changed, I downloaded the code again. Fortunately, the problem was solved.
In the future, I will continue to explore what went wrong. If I get the answer, I will share it with you

@samarth-robo
Copy link
Contributor

glad you solved the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants