Critics and optimizers are wrongly saved - SAC #570

jacekplocharczyk · 2022-03-15T15:51:59Z

I have marked all applicable categories:
- exception-raising bug
- RL algorithm bug
- documentation request (i.e. "X is missing from the documentation.")
- new feature request
I have visited the source website
I have searched through the issue tracker for duplicates

I have mentioned version numbers, operating system and environment, where applicable:

import tianshou, torch, numpy, sys
print(tianshou.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)
0.4.5 1.7.1.post2 1.19.5 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) 
[GCC 9.3.0] linux

Hey guys,
I found a bug where networks and optimizers for SAC critics are different when saved compared to unsaved objects.
The issue doesn't occur with actor, actor optimizer and alpha.

Gist with code to replicate the problem:
https://gist.github.com/jacekplocharczyk/a85964c5ad227b88af00dfb1a4dfd769

Trinkle23897 · 2022-03-16T12:35:37Z

     reloaded_policy.load_state_dict(torch.load(policy_path, map_location=args.device))
+    print((policy.critic1_optim.param_groups[0]["params"][0][0] == reloaded_policy.critic1_optim.param_groups[0]["params"][0][0]).all())
+    print(policy.critic1_optim.param_groups[0]["params"][0][0], reloaded_policy.critic1_optim.param_groups[0]["params"][0][0])

Final reward: -1290.1767704030294, length: 200.0
tensor(True, device='cuda:0')
tensor([ 0.1090, -0.1627, -0.0721, -0.0690], device='cuda:0',
       grad_fn=<SelectBackward0>) tensor([ 0.1090, -0.1627, -0.0721, -0.0690], device='cuda:0',
       grad_fn=<SelectBackward0>)

Did you use the same version of code to generate this model?

jacekplocharczyk · 2022-03-16T14:33:11Z

yes, it's the same version of code.
My output is below (I've added print the same as you did).

Observations shape: (3,)
Actions shape: (1,)
Action range: -2.0 2.0
/root/miniconda3/envs/rl/lib/python3.8/site-packages/tianshou/data/collector.py:61: UserWarning: Single environment detected, wrap to DummyVectorEnv.
  warnings.warn("Single environment detected, wrap to DummyVectorEnv.")
Epoch #1: 2it [00:00, 45.47it/s, env_step=1, len=0, loss/actor=0.600, loss/critic1=51.337, loss/critic2=56.436, n/ep=0, n/st=1, rew=0.00]                           
Epoch #1: test_reward: -1344.290911 ± 46.330156, best_reward: -1322.881265 ± 114.243298 in #0
Epoch #2: 2it [00:00, 92.99it/s, env_step=2, len=0, loss/actor=0.879, loss/critic1=44.331, loss/critic2=49.411, n/ep=0, n/st=1, rew=0.00]                           
Epoch #2: test_reward: -1420.681549 ± 97.445743, best_reward: -1322.881265 ± 114.243298 in #0
Epoch #3: 2it [00:00, 73.36it/s, env_step=3, len=0, loss/actor=1.139, loss/critic1=40.423, loss/critic2=45.485, n/ep=0, n/st=1, rew=0.00]                           
Epoch #3: test_reward: -1363.087723 ± 95.064194, best_reward: -1322.881265 ± 114.243298 in #0
Epoch #4: 2it [00:00, 91.21it/s, env_step=4, len=0, loss/actor=1.430, loss/critic1=37.392, loss/critic2=42.584, n/ep=0, n/st=1, rew=0.00]                           
Epoch #4: test_reward: -1346.489505 ± 286.613257, best_reward: -1322.881265 ± 114.243298 in #0
Epoch #5: 2it [00:00, 96.18it/s, env_step=5, len=0, loss/actor=1.690, loss/critic1=34.437, loss/critic2=39.559, n/ep=0, n/st=1, rew=0.00]                           
Epoch #5: test_reward: -1416.217160 ± 266.236718, best_reward: -1322.881265 ± 114.243298 in #0
{'best_result': '-1322.88 ± 114.24',
 'best_reward': -1322.8812650934738,
 'duration': '4.10s',
 'test_episode': 60,
 'test_speed': '3128.53 step/s',
 'test_step': 12000,
 'test_time': '3.84s',
 'train_episode': 0,
 'train_speed': '19.07 step/s',
 'train_step': 5,
 'train_time/collector': '0.01s',
 'train_time/model': '0.25s'}
Final reward: -1331.6370587230358, length: 200.0
tensor(False)
tensor([ 0.1098, -0.1629, -0.0713, -0.0658], grad_fn=<SelectBackward>) tensor([ 0.1052, -0.1665, -0.0758, -0.0703], grad_fn=<SelectBackward>)

Could you check if not using GPU changes the behaviour?

On version 0.4.6.post1 behaviour is the same as on 0.4.5

Trinkle23897 · 2022-03-16T14:47:21Z

In [1]: import tianshou, gym, torch, numpy, sys
   ...: print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)
0.4.6.post1 0.23.1 1.10.2+cu102 1.21.5 3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0] linux

jacekplocharczyk · 2022-03-16T17:48:32Z

You can replicate this using the following Dockerfile

FROM pytorch/pytorch:1.9.0-cuda10.2-cudnn7-runtime

RUN pip install --upgrade tianshou gym pygame

COPY tianshou_loading_example.py .

ENTRYPOINT [ "python", "tianshou_loading_example.py"]

Just put your test.py file in the same dir as the Dockerfile and run

docker build -t critictest .
docker run critictest

You need also change Pendulum-v0 to Pendulum-v1. Also, I've update the gist so you can just take fresh one.

Trinkle23897 · 2022-03-16T19:09:01Z

Interesting. ~~torch==1.9.0 gives False and torch==1.9.1 gives True.~~

False:
True:
random: 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0

I'll check it later. (probably this weekend)

Trinkle23897 · 2022-03-19T13:28:48Z

It's not wrongly saved. There're two arguments: save_fn and save_checkpoint_fn (I know the names are a little bit confusing).

save_fn only saves policy when it reaches the highest avg reward ever;
save_checkpoint_fn saves policy at the end of each epoch.

So for save_fn, it may be the following case:

epoch1: reward = -1500 (save)
epoch2: reward = -1000 (save)
epoch3: reward = -800 (save)

and the final policy is the best policy.

However, it may also be the following case:

epoch1: reward = -1000 (save)
epoch2: reward = -800 (save)
epoch3: reward = -1200 (do nothing)

and when you load the policy from disk, it is actually from epoch2 instead of epoch3. If you want to use epoch3's result when loading a policy, you should use save_checkpoint_fn instead of save_fn.

save_checkpoint_fn example: (from #350)

tianshou/test/discrete/test_c51.py

Lines 141 to 170 in 10d9190

    
           def save_checkpoint_fn(epoch, env_step, gradient_step): 
        
               # see also: https://pytorch.org/tutorials/beginner/saving_loading_models.html 
        
               torch.save( 
        
                   { 
        
                       'model': policy.state_dict(), 
        
                       'optim': optim.state_dict(), 
        
                   }, os.path.join(log_path, 'checkpoint.pth') 
        
               ) 
        
               pickle.dump( 
        
                   train_collector.buffer, 
        
                   open(os.path.join(log_path, 'train_buffer.pkl'), "wb") 
        
               ) 
        
           if args.resume: 
        
               # load from existing checkpoint 
        
               print(f"Loading agent under {log_path}") 
        
               ckpt_path = os.path.join(log_path, 'checkpoint.pth') 
        
               if os.path.exists(ckpt_path): 
        
                   checkpoint = torch.load(ckpt_path, map_location=args.device) 
        
                   policy.load_state_dict(checkpoint['model']) 
        
                   policy.optim.load_state_dict(checkpoint['optim']) 
        
                   print("Successfully restore policy and optim.") 
        
               else: 
        
                   print("Fail to restore policy and optim.") 
        
               buffer_path = os.path.join(log_path, 'train_buffer.pkl') 
        
               if os.path.exists(buffer_path): 
        
                   train_collector.buffer = pickle.load(open(buffer_path, "rb")) 
        
                   print("Successfully restore buffer.") 
        
               else: 
        
                   print("Fail to restore buffer.")

jacekplocharczyk · 2022-03-21T16:50:38Z

That should be it. Thanks a lot!

Trinkle23897 added the bug Something isn't working label Mar 16, 2022

Trinkle23897 added question Further information is requested and removed bug Something isn't working labels Mar 19, 2022

jacekplocharczyk closed this as completed Mar 21, 2022

Trinkle23897 linked a pull request Mar 21, 2022 that will close this issue

rename save_fn to save_best_fn to avoid ambiguity #575

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Critics and optimizers are wrongly saved - SAC #570

Critics and optimizers are wrongly saved - SAC #570

jacekplocharczyk commented Mar 15, 2022 •

edited

Loading

Trinkle23897 commented Mar 16, 2022 •

edited

Loading

jacekplocharczyk commented Mar 16, 2022 •

edited

Loading

Trinkle23897 commented Mar 16, 2022 •

edited

Loading

jacekplocharczyk commented Mar 16, 2022 •

edited

Loading

Trinkle23897 commented Mar 16, 2022 •

edited

Loading

Trinkle23897 commented Mar 19, 2022 •

edited

Loading

jacekplocharczyk commented Mar 21, 2022

Critics and optimizers are wrongly saved - SAC #570

Critics and optimizers are wrongly saved - SAC #570

Comments

jacekplocharczyk commented Mar 15, 2022 • edited Loading

Trinkle23897 commented Mar 16, 2022 • edited Loading

jacekplocharczyk commented Mar 16, 2022 • edited Loading

Trinkle23897 commented Mar 16, 2022 • edited Loading

jacekplocharczyk commented Mar 16, 2022 • edited Loading

Trinkle23897 commented Mar 16, 2022 • edited Loading

Trinkle23897 commented Mar 19, 2022 • edited Loading

jacekplocharczyk commented Mar 21, 2022

jacekplocharczyk commented Mar 15, 2022 •

edited

Loading

Trinkle23897 commented Mar 16, 2022 •

edited

Loading

jacekplocharczyk commented Mar 16, 2022 •

edited

Loading

Trinkle23897 commented Mar 16, 2022 •

edited

Loading

jacekplocharczyk commented Mar 16, 2022 •

edited

Loading

Trinkle23897 commented Mar 16, 2022 •

edited

Loading

Trinkle23897 commented Mar 19, 2022 •

edited

Loading