Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add offline trainer and discrete BCQ algorithm #263

Merged
merged 29 commits into from
Jan 20, 2021

Conversation

zhujl1991
Copy link
Contributor

Discrete BCQ: https://arxiv.org/abs/1910.01708
Offline trainer discussion: #248 (comment)

Will implement a test_imitation.py in the next PR.

@Trinkle23897

This comment has been minimized.

tianshou/policy/__init__.py Outdated Show resolved Hide resolved
tianshou/policy/modelfree/bcq.py Outdated Show resolved Hide resolved
tianshou/trainer/offline.py Outdated Show resolved Hide resolved
test/discrete/test_bcq.py Outdated Show resolved Hide resolved
tianshou/policy/modelfree/bcq.py Outdated Show resolved Hide resolved
@zhujl1991

This comment has been minimized.

@Trinkle23897

This comment has been minimized.

@codecov-io
Copy link

codecov-io commented Jan 6, 2021

Codecov Report

Merging #263 (0b291de) into master (a633a6a) will increase coverage by 0.21%.
The diff coverage is 99.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #263      +/-   ##
==========================================
+ Coverage   94.09%   94.31%   +0.21%     
==========================================
  Files          42       44       +2     
  Lines        2762     2866     +104     
==========================================
+ Hits         2599     2703     +104     
  Misses        163      163              
Flag Coverage Δ
unittests 94.31% <99.23%> (+0.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tianshou/policy/modelfree/sac.py 87.00% <ø> (-0.13%) ⬇️
tianshou/policy/imitation/discrete_bcq.py 98.41% <98.41%> (ø)
tianshou/policy/__init__.py 100.00% <100.00%> (ø)
tianshou/policy/imitation/base.py 100.00% <100.00%> (ø)
tianshou/policy/modelfree/a2c.py 86.20% <100.00%> (ø)
tianshou/policy/modelfree/c51.py 89.06% <100.00%> (ø)
tianshou/policy/modelfree/dqn.py 98.68% <100.00%> (+1.28%) ⬆️
tianshou/policy/modelfree/ppo.py 96.51% <100.00%> (ø)
tianshou/trainer/__init__.py 100.00% <100.00%> (ø)
tianshou/trainer/offline.py 100.00% <100.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a633a6a...0b291de. Read the comment docs.

@zhujl1991

This comment has been minimized.

@zhujl1991
Copy link
Contributor Author

$python test/discrete/test_bcq.py 
Traceback (most recent call last):
  File "test/discrete/test_bcq.py", line 1, in <module>
    from tianshou.policy import BCQPolicy
  File "/home/jialu.zhu/work/tianshou/tianshou/__init__.py", line 1, in <module>
    from tianshou import data, env, utils, policy, trainer, exploration
  File "/home/jialu.zhu/work/tianshou/tianshou/data/__init__.py", line 2, in <module>
    from tianshou.data.utils.converter import to_numpy, to_torch, to_torch_as
  File "/home/jialu.zhu/work/tianshou/tianshou/data/utils/converter.py", line 1, in <module>
    import h5py
ModuleNotFoundError: No module named 'h5py'

This error is not from my PR. Can you help here? @Trinkle23897

@Trinkle23897
Copy link
Collaborator

OK. Do you have any feedback on this PR? I'll fix other mypy error except the one above.

I'll have a look this afternoon.

This error is not from my PR. Can you help here? @Trinkle23897

Just pip install h5py because in #261 it introduced buffer save/load by hdf5 data format. The setup.py has changed accordingly.

tianshou/policy/__init__.py Outdated Show resolved Hide resolved
tianshou/policy/modelfree/bcq.py Outdated Show resolved Hide resolved
tianshou/trainer/offline.py Outdated Show resolved Hide resolved
test/discrete/test_bcq.py Outdated Show resolved Hide resolved
tianshou/policy/modelfree/bcq.py Outdated Show resolved Hide resolved
tianshou/policy/modelfree/bcq.py Outdated Show resolved Hide resolved
tianshou/policy/modelfree/bcq.py Outdated Show resolved Hide resolved
tianshou/policy/modelfree/bcq.py Outdated Show resolved Hide resolved
tianshou/policy/modelfree/bcq.py Outdated Show resolved Hide resolved
tianshou/policy/modelfree/bcq.py Outdated Show resolved Hide resolved
@lorepieri8
Copy link

Looking forward to test the offline trainer.

@Trinkle23897

This comment has been minimized.

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Jan 13, 2021

@zhujl1991 I use the author's version and d3rlpy to play with CartPole-v0 (by given expert data), but none of them can train a DiscreteBCQ agent that can reach the expert level. It's very weird.

duburcqa
duburcqa previously approved these changes Jan 14, 2021
@ChenDRAG
Copy link
Collaborator

ChenDRAG commented Jan 14, 2021

I think to commit a new policy, experimental result and quantitative analyzsis/comparison with original paper should at least be provided to ensure the correctness of the algorithm.
Also, I worry that the way offline_trainer trains an agent isn't the general meaning of "offline training" in rl literature. (From what I understand offline training usually means that you update the agent after whole episode data is collected, rather than collecting all the data at first in imitation learning). So I suggest that offline trainer not added to tianshou/data/trainer because this is not what people usually use.(And it is not too hard to implement so not necessary to be officially supported). One option is to add this in test/discrete/test_bcq.py. If a lot of users need this 'offline trainer', we then consider officially supporting it.

@duburcqa
Copy link
Collaborator

I think to commit a new policy, experimental result and quantitative analyzsis/comparison with original paper should at least be provided to ensure the correctness of the algorithm.

It would be nice yes, but apart from documenting the algorithms, I don't know any framework actually providing this kind of analysis (I mean extensively) since it is extremely time consuming. Yet it could be possible to compare the performance wrt to a signle other algorithm that is considered to be the state-of-the-art.

@ChenDRAG
Copy link
Collaborator

ChenDRAG commented Jan 15, 2021

Yet it could be possible to compare the performance wrt to a signle other algorithm that is considered to be the state-of-the-art.

standard library of bcq algorithm is provided here, I think it is not hard to provide at least a fair and detailed comparison over one single environment.

@duburcqa
Copy link
Collaborator

standard library of bcq algorithm is provided here, I think it is not hard to provide at least a fair and detailed comparison over one single environment.

Of course, but that's the whole point. Fair and detailed seems to much work to me. Correctness and implementation design must be clear to the user, but the user is also expected to document himself and fond articles analyzing such benchmark.

@ChenDRAG
Copy link
Collaborator

standard library of bcq algorithm is provided here, I think it is not hard to provide at least a fair and detailed comparison over one single environment.

Of course, but that's the whole point. Fair and detailed seems to much work to me. Correctness and implementation design must be clear to the user, but the user is also expected to document himself and fond articles analyzing such benchmark.

I do not mean to add extra work to developers, any experiments that can prove the correctness to a certain range is acceptable, but here is what I believe: algorithm whose correctness cannot be assured is actually a burden to Tianshou if officially supported.

Currently, tianshou actually doesn't achieve very good results on large environments. Most of its policies are only demonstrated in toy environments like Cartpole, but many problems are cannot be exposed in toy environments. (Even on toy environments, I see there are still arguments about whether this algorithm works). In other words, tianshou lacks experiment and is a little bit hard to be used for research for now (Some issues on GitHub are already talking about this). This is a critical problem and is what I believe of the highest priority.

I'm currently working on benchmarking mujoco environments using Tianshou, and found some small problems in policy or in tianshou/data.(will make a pr soon) I also found out that it is actually very hard to modify the code because any small change in tianshou/data you will have to take care of all policies officially supported by tianshou, Even if you do not really understand that algorithm. That's why I suggest at first that new policy can go to /test or /examples because to add code there problems will be a lot less. (You don't even need to change docs, etc). and time can be given to let users to try new policy. We can consider make it officially supported after some time. If urgently needed, graphs that show the correctness or efficiency of code on one single environment does not seem like too much burder?

@Trinkle23897
Copy link
Collaborator

I'm working on the result of BCQ with Atari games, don't worry about that :)

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Jan 15, 2021

@zhujl1991 have you successfully reproduce the result in the paper by the author's code? I tried with Pong and Breakout. It seems that Pong can easily reach +20 but Breakout cannot reach above 100 (most of the time it is around 30~60 and continuing going down).
I use 1e7 max timestamp for training DQN behavior agent with https://github.com/sfujim/BCQ/blob/master/discrete_BCQ/main.py.

@zhujl1991
Copy link
Contributor Author

zhujl1991 commented Jan 15, 2021

@zhujl1991 have you successfully reproduce the result in the paper by the author's code? I tried with Pong and Breakout. It seems that Pong can easily reach +20 but Breakout cannot reach above 100 (most of the time it is around 30~60 and continuing going down).
I use 1e7 max timestamp for training DQN behavior agent with https://github.com/sfujim/BCQ/blob/master/discrete_BCQ/main.py.

I haven't tried to reproduce the result in the paper. We directly use BCQ for our own problem, which gives pretty much the same result as imitation learning.

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Jan 17, 2021

The Atari result should be updated after done issue fixed. Mark as TODO (@Trinkle23897, @ChenDRAG) and currently it is ready to be merged (after the utils.network updating PR #275).

@ChenDRAG ChenDRAG mentioned this pull request Jan 20, 2021
@Trinkle23897 Trinkle23897 requested a review from duburcqa January 20, 2021 09:39
@Trinkle23897 Trinkle23897 merged commit a511cb4 into thu-ml:master Jan 20, 2021
def offline_trainer(
policy: BasePolicy,
buffer: ReplayBuffer,
test_collector: Collector,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Trinkle23897 I just noticed the test_collector, which needs to be initialized by an env, here is not optional. But actually, in practice, the main reason to use these offline algorithms is the lack of an env. So it might be better to make it optional. But I'm not sure what the alternative way to do the test given we don't have an env.

Copy link
Collaborator

@Trinkle23897 Trinkle23897 Jan 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... remove the test_collector and set it to an optional argument mean we don't have any evaluation metric to measure the performance of the current policy. If users don't have any runnable envs, he/she can give self-definied fake envs to test_collector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right. But I feel like that is sort of hacky. Anyway, let's leave it as it is here.

@Trinkle23897 Trinkle23897 mentioned this pull request Apr 20, 2021
BFAnas pushed a commit to BFAnas/tianshou that referenced this pull request May 5, 2024
The result needs to be tuned after `done` issue fixed.

Co-authored-by: n+e <trinkle23897@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Questions about the imitation learning The best practice using Tianshou for offline RL?
6 participants