-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Everytime I restart training from the begining, the obtained model gaves a larger error. #38
Comments
I am using torch 0.4.1, cuda9.2, python3.7.3, ubuntu 18.4 My command for testing: config file is:
|
Hi @jialuwang123321 , thanks for using the code! It is a strange issue. Some suggestions:
|
Huge thanks for your answer! I checked the pose_stats.txt and it changed. (see below). 5736290.7242747 620253.5877489 109.5124567 BTW, I am confused about the reason why it changed. I used 2014-06-26-09-24-58 for both training and evaluation. But I changed two things:
|
@jialuwang123321 you provided one value of |
According to your suggestion, I checked it for every training. I found it remains unchanged for my training and testing dataset. |
@jialuwang123321 OK then we can rule out |
For example, when evaluating the 100_epoch.pth.tar, which was obtained by independently training mapnet from 0 to 100 epochs for three times, I got obviously different results: First time Second time Third time |
Oh, so it is not increasing monotonically. Is the training process is somehow changing your images, pose labels, or config files on the disk? Is backpropagation somehow disabled after the first run? |
The ideas you offer are really helpful! I think it's very likely that I accidentally modified the code, causing the back propagation to be blocked |
glad you solved the issue! |
Dear Mr. Samarth Brahmbhatt
Thank you for your code! It is fantastic. However, I repeated three times, starting all over again (without resume checkpoint), training 0 to 100 epochs with mapnet, and using eval.py to test the trained model.
In theory, I train in strict accordance with the provided experimental parameters and compiling environment. The three pieces of training are completely independent and do not affect each other.
So the test results (for example, using eval.py to test the 100_epoch.pth.tar of the first, second, and third training respectively), the results should be almost the same.
Unfortunately, the trained model (e.g. epoch_100.pth.tar) obtained in the second time's training gave an obviously larger error than that from the first-time's training. So do the third-time's results comparing with the second and the first time's ones.
I feel very confused. I'd like to ask for your opinion. Your suggestion will be very helpful to me. Thank you in advance!
Best
Jialu
The text was updated successfully, but these errors were encountered: