Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make trainer resumable #350

Merged
merged 17 commits into from
May 6, 2021
Merged

Make trainer resumable #350

merged 17 commits into from
May 6, 2021

Conversation

StephenArk30
Copy link
Contributor

This is my simple idea: put everything (best_epoch and so on) in a class (TrainLog), and load/save all these things with hooked functions. I use a param epoch-per-save to save all these things every epoch-per-save epoch.

But the problem is that the loggers log every n step/epoch, and we restart our training process from k * epoch-per-save * step-per-epoch env_step, so the loggers will write reduplicative log. I wonder if we can log only when saving the training data.

@StephenArk30 StephenArk30 mentioned this pull request Apr 23, 2021
@Trinkle23897 Trinkle23897 linked an issue Apr 23, 2021 that may be closed by this pull request
@codecov-commenter
Copy link

codecov-commenter commented Apr 23, 2021

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 97.43590% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.41%. Comparing base (f4e05d5) to head (fc2d3fc).
Report is 579 commits behind head on master.

Files with missing lines Patch % Lines
tianshou/utils/log_tools.py 94.28% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #350      +/-   ##
==========================================
- Coverage   94.48%   94.41%   -0.07%     
==========================================
  Files          53       53              
  Lines        3424     3472      +48     
==========================================
+ Hits         3235     3278      +43     
- Misses        189      194       +5     
Flag Coverage Δ
unittests 94.41% <97.43%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Trinkle23897
Copy link
Collaborator

can we directly restore the metadata from tensorboard? can we continue writing to the same tensorboard?

@Trinkle23897 Trinkle23897 changed the title Make offpolicy trainer resumable Make trainer resumable Apr 24, 2021
@Trinkle23897
Copy link
Collaborator

Here is my proposal: we can use tensorboard.backend.event_processing.event_accumulator reading the existing log, and extract them before each trainer's initialization process.

offpolicy/onpolicy offline
best_epoch/best_reward/best_reward_std load from existing tensorboard load from existing tensorboard
env_step load from existing tensorboard (the latest record) /
gradient_step load from existing tensorboard (the latest record) load from existing tensorboard (the latest record)
last_rew/last_len load from existing tensorboard (the latest record) /

I agree with adding extra fn like epoch_save_fn but actually we don't need load_train_fn -- because it only executes once, why not explicitly execute it by user before the definition of trainer?

@Trinkle23897 Trinkle23897 marked this pull request as ready for review April 25, 2021 07:19
@Trinkle23897
Copy link
Collaborator

@StephenArk30 could you please help me verify the implementation (though I've added the test)?

@Trinkle23897 Trinkle23897 requested review from ChenDRAG and danagi April 26, 2021 00:23
@StephenArk30
Copy link
Contributor Author

StephenArk30 commented Apr 26, 2021

But the logger logs (train) every n step (and test every epoch), and we save the model and the buffer every k epoch. Won't it be a problem? @Trinkle23897

@Trinkle23897
Copy link
Collaborator

No, you can have a try.
Current approach: restore the env_step and gradient_step unrelated to epoch. So you will see the following situation:

  • epoch 1: train 1000 env_step and 100 gradient_step
  • epoch 1 test: save
  • epoch 2: train 500 env_step and 50 gradient_step, killed
  • restore
  • epoch 3: start from env_step=1500 and gradient_step=150, load checkpoint from epoch 1 test

Copy link
Contributor Author

@StephenArk30 StephenArk30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you can have a try.
Current approach: restore the env_step and gradient_step unrelated to epoch. So you will see the following situation:

  • epoch 1: train 1000 env_step and 100 gradient_step
  • epoch 1 test: save
  • epoch 2: train 500 env_step and 50 gradient_step, killed
  • restore
  • epoch 3: start from env_step=1500 and gradient_step=150, load checkpoint from epoch 1 test

What if epoch-per-save = 2

  • epoch 2: train 1000 env_step and 100 gradient_step
  • epoch 2 test: save
  • epoch 2 save model and buffer
  • epoch 3: train 1000 env_step and 100 gradient_step
  • epoch 3 test: save
  • epoch 4: train 500 env_step and 50 gradient_step, killed
  • restore
  • epoch 5 (not 4?): start from env_step=3500 and gradient_step=350, load checkpoint from epoch 3 test, load epoch 2 model and buffer

tianshou/trainer/offpolicy.py Outdated Show resolved Hide resolved
@Trinkle23897
Copy link
Collaborator

  • epoch 5 (not 4?): start from env_step=3500 and gradient_step=350, load checkpoint from epoch 3 test, load epoch 2 model and buffer

Yeah but if you don't start at epoch5, the tensorboard log would be messy because it will overlap with previous epoch4

@StephenArk30
Copy link
Contributor Author

StephenArk30 commented Apr 26, 2021

I agree with you. I think the key is to make model/buffer sync with step, otherwise we might restore a model with only 1000 gradient_step while resuming from 1500 gradient step read from log. In that case we will lose the 500 gradient step and make things unprecise.

So can we log data only before the save_train_fn? For instance:

# offpolicy
for epoch in range(start_epoch + 1, max_epoch + 1):
    while t.n < t.total:
        result = train_collector.collect(n_step)
        env_step += n/st
        store env_step and result in a "train_log" batch
        ......  # store gradient_step and losses too
    test_result = test_episode(......)  # test but do not log
    if time to log:
        logger.log_train_data(train_log)
        logger.log_test_data(test_result)
        save_train_fn()

@Trinkle23897
Copy link
Collaborator

How about now?
BTW, is there any other name better than save_train_fn?

@danagi
Copy link
Collaborator

danagi commented Apr 26, 2021

How about now?
BTW, is there any other name better than save_train_fn?

How about save_checkpoint_fn and epoch_per_checkpoint

@StephenArk30
Copy link
Contributor Author

save_checkpoint_fn would be great. I think save/epoch still can't solve the problem, the model and the buffer still not sync with the log, or I may have some misunderstandings... I will try some examples later @Trinkle23897. @danagi please review the commits and run some tests if time permits.

@Trinkle23897
Copy link
Collaborator

I think save/epoch still can't solve the problem, the model and the buffer still not sync with the log,

why? save/epoch, save/env_step and save/gradient_step are only triggered when calling logger.save_data.

tianshou/utils/log_tools.py Outdated Show resolved Hide resolved
tianshou/utils/log_tools.py Outdated Show resolved Hide resolved
tianshou/utils/log_tools.py Outdated Show resolved Hide resolved
tianshou/trainer/offpolicy.py Show resolved Hide resolved
docs/tutorials/cheatsheet.rst Show resolved Hide resolved
@StephenArk30
Copy link
Contributor Author

try this:

# test_c51.py
def stop_fn(mean_rewards):
    return False
    # return mean_rewards >= env.spec.reward_threshold
python test/discrete/test_c51.py --epoch 20 --epoch-per-save 4 --step-per-epoch 1000
# run 14 epoch
python test/discrete/test_c51.py --epoch 20 --epoch-per-save 4 --step-per-epoch 1000 --resume

I get this in tensorboard:

tensorboard

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Apr 27, 2021

I get this in tensorboard:

@StephenArk30 how about now?

@StephenArk30
Copy link
Contributor Author

I get this in tensorboard:

@StephenArk30 how about now?

Looks better, thanks. I will try later.

ChenDRAG
ChenDRAG previously approved these changes Apr 28, 2021
@Trinkle23897
Copy link
Collaborator

Looks better, thanks. I will try later.

Please ping me if you have tested the code, thanks!

@StephenArk30
Copy link
Contributor Author

StephenArk30 commented Apr 29, 2021

Still get messy log:

python test/discrete/test_c51.py --epoch 20 --step-per-epoch 500
# run 14 epoch and some step in 15 epoch (7344 steps in total)
python test/discrete/test_c51.py --epoch 20 --step-per-epoch 5000 --resume

We need to ensure step in train == step in test == save/env_step.

Now if I stop at 7344 step:

  • step in train: 7344
  • step in test: 7056 (test after epoch 14)
  • step in save: 5500 (save at epoch 11)

When I resume, the train and the test start from 5500, so the tensorboard draws backwards. You can debug to see this issue.

I suggest putting log_train_data and log_test_data into (or right after) save_data, or log save/* in log_test_data @Trinkle23897 .

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Apr 29, 2021

which version of tensorboard do you use now? I use 2.5.0 and everything looks fine here.
2021-04-29 17-42-14 的屏幕截图

Update: 2.4.1 is messy, so that I specify tensorboard>=2.5.0 in setup.py

@Trinkle23897
Copy link
Collaborator

@StephenArk30
Copy link
Contributor Author

No, mine is also 2.5.0. @ChenDRAG @danagi can you try this at your convenience?

However, I think this is not a tensorboard issue. If users use a custom logger with a different writer (like plt), they will find it difficult to handle the overlap (which is handled automatically by tensorboard according to the issue mentioned). If we insist on this implementation, we need to warn users about the overlap in the doc.

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Apr 29, 2021

But tensorflow/tensorboard#4696 fix this issue. Could you please remove all logs and run again with tensorboard==2.5.0? I run several times and only <=2.4.1 will cause messy.

However, I think this is not a tensorboard issue. If users use a custom logger with a different writer (like plt), they will find it difficult to handle the overlap (which is handled automatically by tensorboard according to the issue mentioned).

I don't think we can ignore this issue, because we cannot predict precisely to save checkpoint right before the program being killed. Instead, if we change the env_step and gradient_step as mentioned above, the actual behavior is similar to the plot in your comment #350 (comment), so that I switch to current implementation.

@Trinkle23897
Copy link
Collaborator

I test on mac M1 and the result is exactly the same: 2.5.0 works fine and 2.4.1 cannot work well.

@Trinkle23897
Copy link
Collaborator

@StephenArk30 do you have any further comments?

danagi
danagi previously approved these changes May 1, 2021
@Trinkle23897 Trinkle23897 merged commit 84f5863 into thu-ml:master May 6, 2021
@StephenArk30
Copy link
Contributor Author

@StephenArk30 do you have any further comments?

Went on a holiday, sorry. No further comments. Thanks for the great work!

BFAnas pushed a commit to BFAnas/tianshou that referenced this pull request May 5, 2024
- specify tensorboard >= 2.5.0
- add `save_checkpoint_fn` and `resume_from_log` in trainer

Co-authored-by: Trinkle23897 <trinkle23897@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resumable trainer?
5 participants