-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add offline trainer and discrete BCQ algorithm #263
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Codecov Report
@@ Coverage Diff @@
## master #263 +/- ##
==========================================
+ Coverage 94.09% 94.31% +0.21%
==========================================
Files 42 44 +2
Lines 2762 2866 +104
==========================================
+ Hits 2599 2703 +104
Misses 163 163
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
This comment has been minimized.
This comment has been minimized.
This error is not from my PR. Can you help here? @Trinkle23897 |
I'll have a look this afternoon.
Just |
Looking forward to test the offline trainer. |
This comment has been minimized.
This comment has been minimized.
@zhujl1991 I use the author's version and d3rlpy to play with CartPole-v0 (by given expert data), but none of them can train a DiscreteBCQ agent that can reach the expert level. It's very weird. |
I think to commit a new policy, experimental result and quantitative analyzsis/comparison with original paper should at least be provided to ensure the correctness of the algorithm. |
It would be nice yes, but apart from documenting the algorithms, I don't know any framework actually providing this kind of analysis (I mean extensively) since it is extremely time consuming. Yet it could be possible to compare the performance wrt to a signle other algorithm that is considered to be the state-of-the-art. |
standard library of bcq algorithm is provided here, I think it is not hard to provide at least a fair and detailed comparison over one single environment. |
Of course, but that's the whole point. Fair and detailed seems to much work to me. Correctness and implementation design must be clear to the user, but the user is also expected to document himself and fond articles analyzing such benchmark. |
I do not mean to add extra work to developers, any experiments that can prove the correctness to a certain range is acceptable, but here is what I believe: algorithm whose correctness cannot be assured is actually a burden to Tianshou if officially supported. Currently, tianshou actually doesn't achieve very good results on large environments. Most of its policies are only demonstrated in toy environments like Cartpole, but many problems are cannot be exposed in toy environments. (Even on toy environments, I see there are still arguments about whether this algorithm works). In other words, tianshou lacks experiment and is a little bit hard to be used for research for now (Some issues on GitHub are already talking about this). This is a critical problem and is what I believe of the highest priority. I'm currently working on benchmarking mujoco environments using Tianshou, and found some small problems in policy or in tianshou/data.(will make a pr soon) I also found out that it is actually very hard to modify the code because any small change in tianshou/data you will have to take care of all policies officially supported by tianshou, Even if you do not really understand that algorithm. That's why I suggest at first that new policy can go to /test or /examples because to add code there problems will be a lot less. (You don't even need to change docs, etc). and time can be given to let users to try new policy. We can consider make it officially supported after some time. If urgently needed, graphs that show the correctness or efficiency of code on one single environment does not seem like too much burder? |
I'm working on the result of BCQ with Atari games, don't worry about that :) |
@zhujl1991 have you successfully reproduce the result in the paper by the author's code? I tried with Pong and Breakout. It seems that Pong can easily reach +20 but Breakout cannot reach above 100 (most of the time it is around 30~60 and continuing going down). |
I haven't tried to reproduce the result in the paper. We directly use BCQ for our own problem, which gives pretty much the same result as imitation learning. |
The Atari result should be updated after |
def offline_trainer( | ||
policy: BasePolicy, | ||
buffer: ReplayBuffer, | ||
test_collector: Collector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Trinkle23897 I just noticed the test_collector, which needs to be initialized by an env, here is not optional. But actually, in practice, the main reason to use these offline algorithms is the lack of an env. So it might be better to make it optional. But I'm not sure what the alternative way to do the test given we don't have an env.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... remove the test_collector and set it to an optional argument mean we don't have any evaluation metric to measure the performance of the current policy. If users don't have any runnable envs, he/she can give self-definied fake envs to test_collector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right. But I feel like that is sort of hacky. Anyway, let's leave it as it is here.
The result needs to be tuned after `done` issue fixed. Co-authored-by: n+e <trinkle23897@gmail.com>
Discrete BCQ: https://arxiv.org/abs/1910.01708
Offline trainer discussion: #248 (comment)
Will implement a test_imitation.py in the next PR.