-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make trainer resumable #350
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #350 +/- ##
==========================================
- Coverage 94.48% 94.41% -0.07%
==========================================
Files 53 53
Lines 3424 3472 +48
==========================================
+ Hits 3235 3278 +43
- Misses 189 194 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
can we directly restore the metadata from tensorboard? can we continue writing to the same tensorboard? |
Here is my proposal: we can use
I agree with adding extra fn like |
@StephenArk30 could you please help me verify the implementation (though I've added the test)? |
But the logger logs (train) every n step (and test every epoch), and we save the model and the buffer every k epoch. Won't it be a problem? @Trinkle23897 |
No, you can have a try.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you can have a try.
Current approach: restore the env_step and gradient_step unrelated to epoch. So you will see the following situation:
- epoch 1: train 1000 env_step and 100 gradient_step
- epoch 1 test: save
- epoch 2: train 500 env_step and 50 gradient_step, killed
- restore
- epoch 3: start from env_step=1500 and gradient_step=150, load checkpoint from epoch 1 test
What if epoch-per-save = 2
- epoch 2: train 1000 env_step and 100 gradient_step
- epoch 2 test: save
- epoch 2 save model and buffer
- epoch 3: train 1000 env_step and 100 gradient_step
- epoch 3 test: save
- epoch 4: train 500 env_step and 50 gradient_step, killed
- restore
- epoch 5 (not 4?): start from env_step=3500 and gradient_step=350, load checkpoint from epoch 3 test, load epoch 2 model and buffer
Yeah but if you don't start at epoch5, the tensorboard log would be messy because it will overlap with previous epoch4 |
I agree with you. I think the key is to make model/buffer sync with step, otherwise we might restore a model with only 1000 gradient_step while resuming from 1500 gradient step read from log. In that case we will lose the 500 gradient step and make things unprecise. So can we log data only before the save_train_fn? For instance: # offpolicy
for epoch in range(start_epoch + 1, max_epoch + 1):
while t.n < t.total:
result = train_collector.collect(n_step)
env_step += n/st
store env_step and result in a "train_log" batch
...... # store gradient_step and losses too
test_result = test_episode(......) # test but do not log
if time to log:
logger.log_train_data(train_log)
logger.log_test_data(test_result)
save_train_fn() |
How about now? |
How about |
|
why? |
try this: # test_c51.py
def stop_fn(mean_rewards):
return False
# return mean_rewards >= env.spec.reward_threshold python test/discrete/test_c51.py --epoch 20 --epoch-per-save 4 --step-per-epoch 1000
# run 14 epoch
python test/discrete/test_c51.py --epoch 20 --epoch-per-save 4 --step-per-epoch 1000 --resume I get this in tensorboard: |
@StephenArk30 how about now? |
Looks better, thanks. I will try later. |
Please ping me if you have tested the code, thanks! |
Still get messy log: python test/discrete/test_c51.py --epoch 20 --step-per-epoch 500
# run 14 epoch and some step in 15 epoch (7344 steps in total)
python test/discrete/test_c51.py --epoch 20 --step-per-epoch 5000 --resume We need to ensure Now if I stop at 7344 step:
When I resume, the train and the test start from 5500, so the tensorboard draws backwards. You can debug to see this issue. I suggest putting |
No, mine is also 2.5.0. @ChenDRAG @danagi can you try this at your convenience? However, I think this is not a tensorboard issue. If users use a custom logger with a different writer (like plt), they will find it difficult to handle the overlap (which is handled automatically by tensorboard according to the issue mentioned). If we insist on this implementation, we need to warn users about the overlap in the doc. |
But tensorflow/tensorboard#4696 fix this issue. Could you please remove all logs and run again with tensorboard==2.5.0? I run several times and only <=2.4.1 will cause messy.
I don't think we can ignore this issue, because we cannot predict precisely to save checkpoint right before the program being killed. Instead, if we change the env_step and gradient_step as mentioned above, the actual behavior is similar to the plot in your comment #350 (comment), so that I switch to current implementation. |
I test on mac M1 and the result is exactly the same: 2.5.0 works fine and 2.4.1 cannot work well. |
@StephenArk30 do you have any further comments? |
Went on a holiday, sorry. No further comments. Thanks for the great work! |
- specify tensorboard >= 2.5.0 - add `save_checkpoint_fn` and `resume_from_log` in trainer Co-authored-by: Trinkle23897 <trinkle23897@gmail.com>
This is my simple idea: put everything (best_epoch and so on) in a class (TrainLog), and load/save all these things with hooked functions. I use a param
epoch-per-save
to save all these things everyepoch-per-save
epoch.But the problem is that the loggers log every n step/epoch, and we restart our training process from
k * epoch-per-save * step-per-epoch
env_step, so the loggers will write reduplicative log. I wonder if we can log only when saving the training data.