-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Bigger, Regularized, Optimistic (BRO) #60
base: master
Are you sure you want to change the base?
Conversation
@naumix thanks for the PR, but you don't have to close it to update it, just push on the same branch. |
Thanks, first time merging from a fork |
Hello, I've been playing with BRO on different environments (comparing it to Simba and variants) and so far I have the feeling that BRO vs Simba is similar to REDQ vs Droq, in the sense that BRO paved the way for Simba (the same way REDQ paved the way for DroQ), but Simba is more practical/simpler (just replace the network architecture). In all the small tests I've done (different envs, differents simulators), TQC + Simba + RR=10 + policy_delay=10 consistently performs on-par or better than BRO while having a similar runtime (I used TQC to also have a distributional RL algo). (from a practical point of view, from a research point of view, BRO was a helpful steps that enabled Simba) Side note: it also seems that resets are currently not implemented? |
Hey! I'm happy you found the scaled architectures working well on your problems. Times of bigger RL models are coming 😄
Indeed, Simba is based on small changes to the Bronet architecture which is a part of the BRO algorithm. Whereas BRO can be described as SAC + Quantile Q-learning + Bronet + RR = 2 or 10, the Bronet architecture can be used with any other algorithm just like Simba.
In terms of applied RL, I think that practitioner would try out a suite of SOTA methods to find which one fits their problems best. Since BRO was shown to perform very competitively in three independent papers, it might be a go-to for at least a few people - since it is not common recorded knowledge that TQC + Simba + RR=10 + policy_delay=10 is a good proposition, it may not be an off-the-shelf choice. In terms of academic RL, I think that researchers might be interested in using BRO to benchmark future improvements in RL. Since the contemporary reviewing process expects the researchers to use established algorithms as baselines, it might be easier to ask researchers to run BRO than it is to ask to run variations of modern algorithms that are not common (e.g. TQC + Simba + RR=10 + policy_delay=10). What also makes BRO results easier to contextualize is that we ran all environments with single hyperparameters, whereas Simba did some tiny but impactful changes. For example, Simba authors use a single critic network for DMC and MyoSuite, and two critic networks for HumanoidBench. While this seems to be an irrelevant choice, it implies whether the critic will be updated with pessimistic Clipped Double Q-learning targets or optimistic mean of a single target network - and this design choice was shown to have a great impact on DMC performance (https://arxiv.org/abs/2403.00514). In terms of raw code features, BRO offers the implementation of Bronet, as well as a more general implementation of the pessimistic updates used in algorithms like TD3/SAC/TQC. Our implementation allows users to experiment with different levels of pessimism in Q-value updates, a feature which was shown to greatly impact the performance of actor-critic algorithms (e.g. https://arxiv.org/abs/2102.03765). The implementation of pessimism that we used in BRO is particularly fit for SBX since it is robust with respect to the ensemble size of critics (https://arxiv.org/abs/2110.03375).
Yes, I did not implement resets in the pull since I wanted maximal simplicity and it resulted in a semi-elegant solution. However, I am happy to update the pull and add resets - this will have a big performance impact for RR>2 (btw, when comparing BRO to TQC + Simba + RR=10 + policy_delay=10, did you increase BRO replay ratio to 10?) |
Addition of the BRO algorithm (https://arxiv.org/abs/2405.16158)
Description
Motivation and Context
Types of changes
Checklist:
make format
(required)make check-codestyle
andmake lint
(required)make pytest
andmake type
both pass. (required)make doc
(required)Note: You can run most of the checks using
make commit-checks
.Note: we are using a maximum length of 127 characters per line