Make trainer resumable #350

StephenArk30 · 2021-04-23T14:02:44Z

This is my simple idea: put everything (best_epoch and so on) in a class (TrainLog), and load/save all these things with hooked functions. I use a param epoch-per-save to save all these things every epoch-per-save epoch.

But the problem is that the loggers log every n step/epoch, and we restart our training process from k * epoch-per-save * step-per-epoch env_step, so the loggers will write reduplicative log. I wonder if we can log only when saving the training data.

codecov-commenter · 2021-04-23T14:18:21Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 97.43590% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.41%. Comparing base (f4e05d5) to head (fc2d3fc).
Report is 579 commits behind head on master.

Files with missing lines	Patch %	Lines
tianshou/utils/log_tools.py	94.28%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #350      +/-   ##
==========================================
- Coverage   94.48%   94.41%   -0.07%     
==========================================
  Files          53       53              
  Lines        3424     3472      +48     
==========================================
+ Hits         3235     3278      +43     
- Misses        189      194       +5

Flag	Coverage Δ
unittests	`94.41% <97.43%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Trinkle23897 · 2021-04-24T07:30:20Z

can we directly restore the metadata from tensorboard? can we continue writing to the same tensorboard?

Trinkle23897 · 2021-04-24T12:24:13Z

Here is my proposal: we can use tensorboard.backend.event_processing.event_accumulator reading the existing log, and extract them before each trainer's initialization process.

	offpolicy/onpolicy	offline
best_epoch/best_reward/best_reward_std	load from existing tensorboard	load from existing tensorboard
env_step	load from existing tensorboard (the latest record)	/
gradient_step	load from existing tensorboard (the latest record)	load from existing tensorboard (the latest record)
last_rew/last_len	load from existing tensorboard (the latest record)	/

I agree with adding extra fn like epoch_save_fn but actually we don't need load_train_fn -- because it only executes once, why not explicitly execute it by user before the definition of trainer?

test/discrete/test_il_bcq.py

Trinkle23897 · 2021-04-25T07:21:25Z

@StephenArk30 could you please help me verify the implementation (though I've added the test)?

StephenArk30 · 2021-04-26T02:25:26Z

But the logger logs (train) every n step (and test every epoch), and we save the model and the buffer every k epoch. Won't it be a problem? @Trinkle23897

Trinkle23897 · 2021-04-26T02:29:23Z

No, you can have a try.
Current approach: restore the env_step and gradient_step unrelated to epoch. So you will see the following situation:

epoch 1: train 1000 env_step and 100 gradient_step
epoch 1 test: save
epoch 2: train 500 env_step and 50 gradient_step, killed
restore
epoch 3: start from env_step=1500 and gradient_step=150, load checkpoint from epoch 1 test

docs/tutorials/cheatsheet.rst

StephenArk30

No, you can have a try.
Current approach: restore the env_step and gradient_step unrelated to epoch. So you will see the following situation:

epoch 1: train 1000 env_step and 100 gradient_step

epoch 1 test: save

epoch 2: train 500 env_step and 50 gradient_step, killed

restore

epoch 3: start from env_step=1500 and gradient_step=150, load checkpoint from epoch 1 test

What if epoch-per-save = 2

epoch 2: train 1000 env_step and 100 gradient_step
epoch 2 test: save
epoch 2 save model and buffer
epoch 3: train 1000 env_step and 100 gradient_step
epoch 3 test: save
epoch 4: train 500 env_step and 50 gradient_step, killed
restore
epoch 5 (not 4?): start from env_step=3500 and gradient_step=350, load checkpoint from epoch 3 test, load epoch 2 model and buffer

tianshou/trainer/offpolicy.py

Trinkle23897 · 2021-04-26T06:04:22Z

epoch 5 (not 4?): start from env_step=3500 and gradient_step=350, load checkpoint from epoch 3 test, load epoch 2 model and buffer

Yeah but if you don't start at epoch5, the tensorboard log would be messy because it will overlap with previous epoch4

StephenArk30 · 2021-04-26T07:01:04Z

I agree with you. I think the key is to make model/buffer sync with step, otherwise we might restore a model with only 1000 gradient_step while resuming from 1500 gradient step read from log. In that case we will lose the 500 gradient step and make things unprecise.

So can we log data only before the save_train_fn? For instance:

# offpolicy
for epoch in range(start_epoch + 1, max_epoch + 1):
    while t.n < t.total:
        result = train_collector.collect(n_step)
        env_step += n/st
        store env_step and result in a "train_log" batch
        ......  # store gradient_step and losses too
    test_result = test_episode(......)  # test but do not log
    if time to log:
        logger.log_train_data(train_log)
        logger.log_test_data(test_result)
        save_train_fn()

Trinkle23897 · 2021-04-26T08:10:17Z

How about now?
BTW, is there any other name better than save_train_fn?

danagi · 2021-04-26T17:06:57Z

How about now?
BTW, is there any other name better than save_train_fn?

How about save_checkpoint_fn and epoch_per_checkpoint

StephenArk30 · 2021-04-27T03:15:44Z

save_checkpoint_fn would be great. I think save/epoch still can't solve the problem, the model and the buffer still not sync with the log, or I may have some misunderstandings... I will try some examples later @Trinkle23897. @danagi please review the commits and run some tests if time permits.

Trinkle23897 · 2021-04-27T03:27:28Z

I think save/epoch still can't solve the problem, the model and the buffer still not sync with the log,

why? save/epoch, save/env_step and save/gradient_step are only triggered when calling logger.save_data.

tianshou/utils/log_tools.py

tianshou/trainer/offpolicy.py

docs/tutorials/cheatsheet.rst

StephenArk30 · 2021-04-27T13:37:17Z

try this:

# test_c51.py
def stop_fn(mean_rewards):
    return False
    # return mean_rewards >= env.spec.reward_threshold

python test/discrete/test_c51.py --epoch 20 --epoch-per-save 4 --step-per-epoch 1000
# run 14 epoch
python test/discrete/test_c51.py --epoch 20 --epoch-per-save 4 --step-per-epoch 1000 --resume

I get this in tensorboard:

Trinkle23897 · 2021-04-27T14:47:19Z

I get this in tensorboard:

@StephenArk30 how about now?

StephenArk30 · 2021-04-28T08:16:53Z

I get this in tensorboard:

@StephenArk30 how about now?

Looks better, thanks. I will try later.

Trinkle23897 · 2021-04-28T12:56:00Z

Looks better, thanks. I will try later.

Please ping me if you have tested the code, thanks!

StephenArk30 · 2021-04-29T04:58:01Z

Still get messy log:

python test/discrete/test_c51.py --epoch 20 --step-per-epoch 500
# run 14 epoch and some step in 15 epoch (7344 steps in total)
python test/discrete/test_c51.py --epoch 20 --step-per-epoch 5000 --resume

We need to ensure step in train == step in test == save/env_step.

Now if I stop at 7344 step:

step in train: 7344
step in test: 7056 (test after epoch 14)
step in save: 5500 (save at epoch 11)

When I resume, the train and the test start from 5500, so the tensorboard draws backwards. You can debug to see this issue.

I suggest putting log_train_data and log_test_data into (or right after) save_data, or log save/* in log_test_data @Trinkle23897 .

Trinkle23897 · 2021-04-29T09:42:38Z

which version of tensorboard do you use now? I use 2.5.0 and everything looks fine here.

Update: 2.4.1 is messy, so that I specify tensorboard>=2.5.0 in setup.py

Trinkle23897 · 2021-04-29T09:54:33Z

tensorflow/tensorboard#4696

StephenArk30 · 2021-04-29T10:31:14Z

No, mine is also 2.5.0. @ChenDRAG @danagi can you try this at your convenience?

However, I think this is not a tensorboard issue. If users use a custom logger with a different writer (like plt), they will find it difficult to handle the overlap (which is handled automatically by tensorboard according to the issue mentioned). If we insist on this implementation, we need to warn users about the overlap in the doc.

Trinkle23897 · 2021-04-29T10:38:17Z

But tensorflow/tensorboard#4696 fix this issue. Could you please remove all logs and run again with tensorboard==2.5.0? I run several times and only <=2.4.1 will cause messy.

However, I think this is not a tensorboard issue. If users use a custom logger with a different writer (like plt), they will find it difficult to handle the overlap (which is handled automatically by tensorboard according to the issue mentioned).

I don't think we can ignore this issue, because we cannot predict precisely to save checkpoint right before the program being killed. Instead, if we change the env_step and gradient_step as mentioned above, the actual behavior is similar to the plot in your comment #350 (comment), so that I switch to current implementation.

Trinkle23897 · 2021-04-29T23:27:04Z

I test on mac M1 and the result is exactly the same: 2.5.0 works fine and 2.4.1 cannot work well.

Trinkle23897 · 2021-04-30T22:59:10Z

@StephenArk30 do you have any further comments?

StephenArk30 · 2021-05-07T04:18:22Z

@StephenArk30 do you have any further comments?

Went on a holiday, sorry. No further comments. Thanks for the great work!

- specify tensorboard >= 2.5.0 - add `save_checkpoint_fn` and `resume_from_log` in trainer Co-authored-by: Trinkle23897 <trinkle23897@gmail.com>

Feat: Make offpolicy trainer resumable

cef6289

StephenArk30 mentioned this pull request Apr 23, 2021

Resumable trainer? #349

Closed

Trinkle23897 linked an issue Apr 23, 2021 that may be closed by this pull request

Resumable trainer? #349

Closed

Trinkle23897 changed the title ~~Make offpolicy trainer resumable~~ Make trainer resumable Apr 24, 2021

Trinkle23897 added 3 commits April 25, 2021 14:17

refact

93a5d1a

onpolicy trainer resumable

ec16992

offline

a09cf51

Trinkle23897 reviewed Apr 25, 2021

View reviewed changes

test/discrete/test_il_bcq.py Outdated Show resolved Hide resolved

Trinkle23897 marked this pull request as ready for review April 25, 2021 07:19

Trinkle23897 added 2 commits April 25, 2021 15:24

Merge branch 'master' into master

7603838

fix optim bug

0664823

Trinkle23897 requested review from ChenDRAG and danagi April 26, 2021 00:23

Trinkle23897 reviewed Apr 26, 2021

View reviewed changes

docs/tutorials/cheatsheet.rst Show resolved Hide resolved

StephenArk30 commented Apr 26, 2021

View reviewed changes

tianshou/trainer/offpolicy.py Outdated Show resolved Hide resolved

save in logger

921ca86

Trinkle23897 added 2 commits April 27, 2021 20:37

fix test

ba461e0

epoch_per_checkpoint

7a25e6e

ChenDRAG reviewed Apr 27, 2021

View reviewed changes

Merge branch 'master' into master

a703605

Trinkle23897 added 2 commits April 27, 2021 22:45

remove

e93a3ee

Merge branch 'master' of github.com:StephenArk30/tianshou

06eb679

move epoch_per_checkpoint to logger(save_interval)

740a021

ChenDRAG previously approved these changes Apr 28, 2021

View reviewed changes

Fix: offpolicy test with gradient_step at the beginning

ba20279

StephenArk30 dismissed ChenDRAG’s stale review via ba20279 April 29, 2021 04:43

Trinkle23897 added 2 commits April 29, 2021 17:45

specify tensorboard version

527017b

docs

623bf31

danagi previously approved these changes May 1, 2021

View reviewed changes

docs

fc2d3fc

Trinkle23897 dismissed danagi’s stale review via fc2d3fc May 6, 2021 00:30

Trinkle23897 approved these changes May 6, 2021

View reviewed changes

Trinkle23897 merged commit 84f5863 into thu-ml:master May 6, 2021

Trinkle23897 mentioned this pull request Aug 29, 2021

Add Weights and Biases Logger #427

Merged

Trinkle23897 mentioned this pull request Mar 19, 2022

Critics and optimizers are wrongly saved - SAC #570

Closed

8 tasks

BFAnas pushed a commit to BFAnas/tianshou that referenced this pull request May 5, 2024

Make trainer resumable (thu-ml#350)

5f23d0d

- specify tensorboard >= 2.5.0 - add `save_checkpoint_fn` and `resume_from_log` in trainer Co-authored-by: Trinkle23897 <trinkle23897@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make trainer resumable #350

Make trainer resumable #350

StephenArk30 commented Apr 23, 2021

codecov-commenter commented Apr 23, 2021 •

edited

Loading

Trinkle23897 commented Apr 24, 2021

Trinkle23897 commented Apr 24, 2021

Trinkle23897 commented Apr 25, 2021

StephenArk30 commented Apr 26, 2021 •

edited

Loading

Trinkle23897 commented Apr 26, 2021

StephenArk30 left a comment

Trinkle23897 commented Apr 26, 2021

StephenArk30 commented Apr 26, 2021 •

edited

Loading

Trinkle23897 commented Apr 26, 2021

danagi commented Apr 26, 2021

StephenArk30 commented Apr 27, 2021

Trinkle23897 commented Apr 27, 2021

StephenArk30 commented Apr 27, 2021

Trinkle23897 commented Apr 27, 2021 •

edited

Loading

StephenArk30 commented Apr 28, 2021

Trinkle23897 commented Apr 28, 2021

StephenArk30 commented Apr 29, 2021 •

edited

Loading

Trinkle23897 commented Apr 29, 2021 •

edited

Loading

Trinkle23897 commented Apr 29, 2021

StephenArk30 commented Apr 29, 2021

Trinkle23897 commented Apr 29, 2021 •

edited

Loading

Trinkle23897 commented Apr 29, 2021

Trinkle23897 commented Apr 30, 2021

StephenArk30 commented May 7, 2021

Make trainer resumable #350

Make trainer resumable #350

Conversation

StephenArk30 commented Apr 23, 2021

codecov-commenter commented Apr 23, 2021 • edited Loading

Codecov Report

Trinkle23897 commented Apr 24, 2021

Trinkle23897 commented Apr 24, 2021

Trinkle23897 commented Apr 25, 2021

StephenArk30 commented Apr 26, 2021 • edited Loading

Trinkle23897 commented Apr 26, 2021

StephenArk30 left a comment

Choose a reason for hiding this comment

Trinkle23897 commented Apr 26, 2021

StephenArk30 commented Apr 26, 2021 • edited Loading

Trinkle23897 commented Apr 26, 2021

danagi commented Apr 26, 2021

StephenArk30 commented Apr 27, 2021

Trinkle23897 commented Apr 27, 2021

StephenArk30 commented Apr 27, 2021

Trinkle23897 commented Apr 27, 2021 • edited Loading

StephenArk30 commented Apr 28, 2021

Trinkle23897 commented Apr 28, 2021

StephenArk30 commented Apr 29, 2021 • edited Loading

Trinkle23897 commented Apr 29, 2021 • edited Loading

Trinkle23897 commented Apr 29, 2021

StephenArk30 commented Apr 29, 2021

Trinkle23897 commented Apr 29, 2021 • edited Loading

Trinkle23897 commented Apr 29, 2021

Trinkle23897 commented Apr 30, 2021

StephenArk30 commented May 7, 2021

codecov-commenter commented Apr 23, 2021 •

edited

Loading

StephenArk30 commented Apr 26, 2021 •

edited

Loading

StephenArk30 commented Apr 26, 2021 •

edited

Loading

Trinkle23897 commented Apr 27, 2021 •

edited

Loading

StephenArk30 commented Apr 29, 2021 •

edited

Loading

Trinkle23897 commented Apr 29, 2021 •

edited

Loading

Trinkle23897 commented Apr 29, 2021 •

edited

Loading