You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Although it is now common to solve most problems using discounted reward, this does not always correspond to the real problem (not episodic, long-horizon), where it is important to use algorithms that optimize the average reward.
There are only two adaptations of modern algorithms for average-reward setting: A-TRPO, A-PPO. A-PPO is almost the same as regular PPO and much easier to implement than A-TRPO. Besides, it also solves some problems [1, 2] of regular PPO (e.g. sampling from undiscounted state distribution).
In my free time, I tried to reproduce the results of the paper and it kinda worked. Nevertheless, I think it is important to compare it with the other algorithms, esp PPO. This will be easier to do if they share the same code base and common hacks. So it seems to me that adding it to cleanrl will help with this. In addition, I think average reward setting is very underrated and this will help popularize it (if A-PPO really works).
The text was updated successfully, but these errors were encountered:
This looks pretty interesting - APPO would be a new algorithm. I think it's great to have it. Would you be up to going through the new algorithm contribution checklist? See #186 as an example.
Yes, but it will take some time, especially documentation and testing. I think it would be reasonable to start from apo_continuous_action.py and compare it only with ppo_continuous_action.py as a test.
Although it is now common to solve most problems using discounted reward, this does not always correspond to the real problem (not episodic, long-horizon), where it is important to use algorithms that optimize the average reward.
There are only two adaptations of modern algorithms for average-reward setting: A-TRPO, A-PPO. A-PPO is almost the same as regular PPO and much easier to implement than A-TRPO. Besides, it also solves some problems [1, 2] of regular PPO (e.g. sampling from undiscounted state distribution).
In my free time, I tried to reproduce the results of the paper and it kinda worked. Nevertheless, I think it is important to compare it with the other algorithms, esp PPO. This will be easier to do if they share the same code base and common hacks. So it seems to me that adding it to cleanrl will help with this. In addition, I think average reward setting is very underrated and this will help popularize it (if A-PPO really works).
The text was updated successfully, but these errors were encountered: