Add offline trainer and discrete BCQ algorithm #263

zhujl1991 · 2020-12-15T08:20:56Z

Discrete BCQ: https://arxiv.org/abs/1910.01708
Offline trainer discussion: #248 (comment)

Will implement a test_imitation.py in the next PR.

tianshou/policy/__init__.py

tianshou/policy/modelfree/bcq.py

tianshou/trainer/offline.py

test/discrete/test_bcq.py

tianshou/policy/modelfree/bcq.py

codecov-io · 2021-01-06T02:29:17Z

Codecov Report

Merging #263 (0b291de) into master (a633a6a) will increase coverage by 0.21%.
The diff coverage is 99.23%.

@@            Coverage Diff             @@
##           master     #263      +/-   ##
==========================================
+ Coverage   94.09%   94.31%   +0.21%     
==========================================
  Files          42       44       +2     
  Lines        2762     2866     +104     
==========================================
+ Hits         2599     2703     +104     
  Misses        163      163

Flag	Coverage Δ
unittests	`94.31% <99.23%> (+0.21%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tianshou/policy/modelfree/sac.py	`87.00% <ø> (-0.13%)`	⬇️
tianshou/policy/imitation/discrete_bcq.py	`98.41% <98.41%> (ø)`
tianshou/policy/__init__.py	`100.00% <100.00%> (ø)`
tianshou/policy/imitation/base.py	`100.00% <100.00%> (ø)`
tianshou/policy/modelfree/a2c.py	`86.20% <100.00%> (ø)`
tianshou/policy/modelfree/c51.py	`89.06% <100.00%> (ø)`
tianshou/policy/modelfree/dqn.py	`98.68% <100.00%> (+1.28%)`	⬆️
tianshou/policy/modelfree/ppo.py	`96.51% <100.00%> (ø)`
tianshou/trainer/__init__.py	`100.00% <100.00%> (ø)`
tianshou/trainer/offline.py	`100.00% <100.00%> (ø)`
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a633a6a...0b291de. Read the comment docs.

zhujl1991 · 2021-01-06T04:34:12Z

$python test/discrete/test_bcq.py 
Traceback (most recent call last):
  File "test/discrete/test_bcq.py", line 1, in <module>
    from tianshou.policy import BCQPolicy
  File "/home/jialu.zhu/work/tianshou/tianshou/__init__.py", line 1, in <module>
    from tianshou import data, env, utils, policy, trainer, exploration
  File "/home/jialu.zhu/work/tianshou/tianshou/data/__init__.py", line 2, in <module>
    from tianshou.data.utils.converter import to_numpy, to_torch, to_torch_as
  File "/home/jialu.zhu/work/tianshou/tianshou/data/utils/converter.py", line 1, in <module>
    import h5py
ModuleNotFoundError: No module named 'h5py'

This error is not from my PR. Can you help here? @Trinkle23897

Trinkle23897 · 2021-01-06T04:49:25Z

OK. Do you have any feedback on this PR? I'll fix other mypy error except the one above.

I'll have a look this afternoon.

This error is not from my PR. Can you help here? @Trinkle23897

Just pip install h5py because in #261 it introduced buffer save/load by hdf5 data format. The setup.py has changed accordingly.

tianshou/policy/__init__.py

tianshou/policy/modelfree/bcq.py

tianshou/trainer/offline.py

test/discrete/test_bcq.py

tianshou/policy/modelfree/bcq.py

lorepieri8 · 2021-01-08T15:46:57Z

Looking forward to test the offline trainer.

Trinkle23897 · 2021-01-13T14:59:36Z

@zhujl1991 I use the author's version and d3rlpy to play with CartPole-v0 (by given expert data), but none of them can train a DiscreteBCQ agent that can reach the expert level. It's very weird.

ChenDRAG · 2021-01-14T14:42:05Z

I think to commit a new policy, experimental result and quantitative analyzsis/comparison with original paper should at least be provided to ensure the correctness of the algorithm.
Also, I worry that the way offline_trainer trains an agent isn't the general meaning of "offline training" in rl literature. (From what I understand offline training usually means that you update the agent after whole episode data is collected, rather than collecting all the data at first in imitation learning). So I suggest that offline trainer not added to tianshou/data/trainer because this is not what people usually use.(And it is not too hard to implement so not necessary to be officially supported). One option is to add this in test/discrete/test_bcq.py. If a lot of users need this 'offline trainer', we then consider officially supporting it.

duburcqa · 2021-01-14T15:42:01Z

I think to commit a new policy, experimental result and quantitative analyzsis/comparison with original paper should at least be provided to ensure the correctness of the algorithm.

It would be nice yes, but apart from documenting the algorithms, I don't know any framework actually providing this kind of analysis (I mean extensively) since it is extremely time consuming. Yet it could be possible to compare the performance wrt to a signle other algorithm that is considered to be the state-of-the-art.

ChenDRAG · 2021-01-15T01:49:34Z

Yet it could be possible to compare the performance wrt to a signle other algorithm that is considered to be the state-of-the-art.

standard library of bcq algorithm is provided here, I think it is not hard to provide at least a fair and detailed comparison over one single environment.

duburcqa · 2021-01-15T06:05:52Z

standard library of bcq algorithm is provided here, I think it is not hard to provide at least a fair and detailed comparison over one single environment.

Of course, but that's the whole point. Fair and detailed seems to much work to me. Correctness and implementation design must be clear to the user, but the user is also expected to document himself and fond articles analyzing such benchmark.

ChenDRAG · 2021-01-15T07:10:11Z

standard library of bcq algorithm is provided here, I think it is not hard to provide at least a fair and detailed comparison over one single environment.

Of course, but that's the whole point. Fair and detailed seems to much work to me. Correctness and implementation design must be clear to the user, but the user is also expected to document himself and fond articles analyzing such benchmark.

I do not mean to add extra work to developers, any experiments that can prove the correctness to a certain range is acceptable, but here is what I believe: algorithm whose correctness cannot be assured is actually a burden to Tianshou if officially supported.

Currently, tianshou actually doesn't achieve very good results on large environments. Most of its policies are only demonstrated in toy environments like Cartpole, but many problems are cannot be exposed in toy environments. (Even on toy environments, I see there are still arguments about whether this algorithm works). In other words, tianshou lacks experiment and is a little bit hard to be used for research for now (Some issues on GitHub are already talking about this). This is a critical problem and is what I believe of the highest priority.

I'm currently working on benchmarking mujoco environments using Tianshou, and found some small problems in policy or in tianshou/data.(will make a pr soon) I also found out that it is actually very hard to modify the code because any small change in tianshou/data you will have to take care of all policies officially supported by tianshou, Even if you do not really understand that algorithm. That's why I suggest at first that new policy can go to /test or /examples because to add code there problems will be a lot less. (You don't even need to change docs, etc). and time can be given to let users to try new policy. We can consider make it officially supported after some time. If urgently needed, graphs that show the correctness or efficiency of code on one single environment does not seem like too much burder?

Trinkle23897 · 2021-01-15T07:19:49Z

I'm working on the result of BCQ with Atari games, don't worry about that :)

Trinkle23897 · 2021-01-15T16:00:57Z

@zhujl1991 have you successfully reproduce the result in the paper by the author's code? I tried with Pong and Breakout. It seems that Pong can easily reach +20 but Breakout cannot reach above 100 (most of the time it is around 30~60 and continuing going down).
I use 1e7 max timestamp for training DQN behavior agent with https://github.com/sfujim/BCQ/blob/master/discrete_BCQ/main.py.

zhujl1991 · 2021-01-15T18:24:01Z

@zhujl1991 have you successfully reproduce the result in the paper by the author's code? I tried with Pong and Breakout. It seems that Pong can easily reach +20 but Breakout cannot reach above 100 (most of the time it is around 30~60 and continuing going down).
I use 1e7 max timestamp for training DQN behavior agent with https://github.com/sfujim/BCQ/blob/master/discrete_BCQ/main.py.

I haven't tried to reproduce the result in the paper. We directly use BCQ for our own problem, which gives pretty much the same result as imitation learning.

Trinkle23897 · 2021-01-17T07:58:23Z

The Atari result should be updated after done issue fixed. Mark as TODO (@Trinkle23897, @ChenDRAG) and currently it is ready to be merged (after the utils.network updating PR #275).

zhujl1991 · 2021-01-21T20:44:06Z

tianshou/trainer/offline.py

+def offline_trainer(
+    policy: BasePolicy,
+    buffer: ReplayBuffer,
+    test_collector: Collector,


@Trinkle23897 I just noticed the test_collector, which needs to be initialized by an env, here is not optional. But actually, in practice, the main reason to use these offline algorithms is the lack of an env. So it might be better to make it optional. But I'm not sure what the alternative way to do the test given we don't have an env.

Hmm... remove the test_collector and set it to an optional argument mean we don't have any evaluation metric to measure the performance of the current policy. If users don't have any runnable envs, he/she can give self-definied fake envs to test_collector.

All right. But I feel like that is sort of hacky. Anyway, let's leave it as it is here.

test/discrete/test_il_bcq.py

The result needs to be tuned after `done` issue fixed. Co-authored-by: n+e <trinkle23897@gmail.com>

zhujl1991 added 6 commits December 14, 2020 20:36

work

1cdc203

removing_comments

ae8c12e

removing_comments

b5eba6c

format

5e6873e

cleaning

8035e8b

almost

5c52a18

This comment has been minimized.

Sign in to view

Trinkle23897 reviewed Dec 15, 2020

View reviewed changes

This was linked to issues Dec 16, 2020

Questions about the imitation learning #248

Closed

The best practice using Tianshou for offline RL? #188

Closed

zhujl1991 added 2 commits January 6, 2021 06:47

feedback

9bf02be

lint

288c154

This comment has been minimized.

Sign in to view

Trinkle23897 added 2 commits January 6, 2021 08:57

Merge branch 'master' into work

6bc7ac7

Merge branch 'master' into work

7919fc6

This comment has been minimized.

Sign in to view

zhujl1991 added 2 commits January 6, 2021 12:24

.

48f0012

Merge branch 'work' of github.com:zhujl1991/tianshou into work

9f76425

Trinkle23897 reviewed Jan 6, 2021

View reviewed changes

Trinkle23897 mentioned this pull request Jan 11, 2021

problem of max_grad_norm in PPOPolicy #270

Closed

This comment has been minimized.

Sign in to view

Trinkle23897 added 3 commits January 12, 2021 10:04

resolve thu-ml#269, thu-ml#270

f4b9aa6

update BCQPolicy and BCQN

36a137c

runnable

2e22daf

fix

8489d17

duburcqa previously approved these changes Jan 14, 2021

View reviewed changes

add atari_bcq, still need check

c2cf972

Trinkle23897 dismissed duburcqa’s stale review via c2cf972 January 15, 2021 11:18

Trinkle23897 added 5 commits January 16, 2021 16:26

update examples

a3b51a2

Merge branch 'master' into work

1d34109

tune eps code

667b2f8

Merge branch 'work' of github.com:zhujl1991/tianshou into work

04b1379

fix eps mask

151fd0b

ChenDRAG mentioned this pull request Jan 20, 2021

Add QR-DQN algorithm #276

Merged

Trinkle23897 added 4 commits January 20, 2021 16:56

Merge branch 'master' into work

705a919

fix test

5ef0c4c

update readme

a37a542

trailing comma

0b291de

Trinkle23897 requested a review from duburcqa January 20, 2021 09:39

duburcqa approved these changes Jan 20, 2021

View reviewed changes

Trinkle23897 merged commit a511cb4 into thu-ml:master Jan 20, 2021

zhujl1991 commented Jan 21, 2021

View reviewed changes

zhujl1991 commented Jan 23, 2021

View reviewed changes

test/discrete/test_il_bcq.py Show resolved Hide resolved

Trinkle23897 mentioned this pull request Apr 20, 2021

fix atari_bcq #345

Merged

BFAnas pushed a commit to BFAnas/tianshou that referenced this pull request May 5, 2024

Add offline trainer and discrete BCQ algorithm (thu-ml#263)

37b6c75

The result needs to be tuned after `done` issue fixed. Co-authored-by: n+e <trinkle23897@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add offline trainer and discrete BCQ algorithm #263

Add offline trainer and discrete BCQ algorithm #263

zhujl1991 commented Dec 15, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

codecov-io commented Jan 6, 2021 •

edited

Loading

This comment has been minimized.

zhujl1991 commented Jan 6, 2021

Trinkle23897 commented Jan 6, 2021

lorepieri8 commented Jan 8, 2021

This comment has been minimized.

Trinkle23897 commented Jan 13, 2021 •

edited

Loading

ChenDRAG commented Jan 14, 2021 •

edited by Trinkle23897

Loading

duburcqa commented Jan 14, 2021

ChenDRAG commented Jan 15, 2021 •

edited by Trinkle23897

Loading

duburcqa commented Jan 15, 2021

ChenDRAG commented Jan 15, 2021

Trinkle23897 commented Jan 15, 2021

Trinkle23897 commented Jan 15, 2021 •

edited

Loading

zhujl1991 commented Jan 15, 2021 •

edited

Loading

Trinkle23897 commented Jan 17, 2021 •

edited

Loading

zhujl1991 Jan 21, 2021

Trinkle23897 Jan 23, 2021 •

edited

Loading

zhujl1991 Jan 23, 2021

Add offline trainer and discrete BCQ algorithm #263

Add offline trainer and discrete BCQ algorithm #263

Conversation

zhujl1991 commented Dec 15, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

codecov-io commented Jan 6, 2021 • edited Loading

Codecov Report

This comment has been minimized.

zhujl1991 commented Jan 6, 2021

Trinkle23897 commented Jan 6, 2021

lorepieri8 commented Jan 8, 2021

This comment has been minimized.

Trinkle23897 commented Jan 13, 2021 • edited Loading

ChenDRAG commented Jan 14, 2021 • edited by Trinkle23897 Loading

duburcqa commented Jan 14, 2021

ChenDRAG commented Jan 15, 2021 • edited by Trinkle23897 Loading

duburcqa commented Jan 15, 2021

ChenDRAG commented Jan 15, 2021

Trinkle23897 commented Jan 15, 2021

Trinkle23897 commented Jan 15, 2021 • edited Loading

zhujl1991 commented Jan 15, 2021 • edited Loading

Trinkle23897 commented Jan 17, 2021 • edited Loading

zhujl1991 Jan 21, 2021

Choose a reason for hiding this comment

Trinkle23897 Jan 23, 2021 • edited Loading

Choose a reason for hiding this comment

zhujl1991 Jan 23, 2021

Choose a reason for hiding this comment

codecov-io commented Jan 6, 2021 •

edited

Loading

Trinkle23897 commented Jan 13, 2021 •

edited

Loading

ChenDRAG commented Jan 14, 2021 •

edited by Trinkle23897

Loading

ChenDRAG commented Jan 15, 2021 •

edited by Trinkle23897

Loading

Trinkle23897 commented Jan 15, 2021 •

edited

Loading

zhujl1991 commented Jan 15, 2021 •

edited

Loading

Trinkle23897 commented Jan 17, 2021 •

edited

Loading

Trinkle23897 Jan 23, 2021 •

edited

Loading