Add Bigger, Regularized, Optimistic (BRO) #60

naumix · 2024-11-10T03:14:12Z

Addition of the BRO algorithm (https://arxiv.org/abs/2405.16158)

Description

Motivation and Context

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

I've read the CONTRIBUTION guide (required)
I have updated the changelog accordingly (required).
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.
I have reformatted the code using make format (required)
I have checked the codestyle using make check-codestyle and make lint (required)
I have ensured make pytest and make type both pass. (required)
I have checked that the documentation builds using make doc (required)

Note: You can run most of the checks using make commit-checks.

Note: we are using a maximum length of 127 characters per line

Wandb

…o add-BRO

araffin · 2024-11-10T06:57:30Z

@naumix thanks for the PR, but you don't have to close it to update it, just push on the same branch.

naumix · 2024-11-12T00:37:39Z

Thanks, first time merging from a fork

araffin · 2024-11-25T16:52:07Z

Hello,

I've been playing with BRO on different environments (comparing it to Simba and variants) and so far I have the feeling that BRO vs Simba is similar to REDQ vs Droq, in the sense that BRO paved the way for Simba (the same way REDQ paved the way for DroQ), but Simba is more practical/simpler (just replace the network architecture).

In all the small tests I've done (different envs, differents simulators), TQC + Simba + RR=10 + policy_delay=10 consistently performs on-par or better than BRO while having a similar runtime (I used TQC to also have a distributional RL algo).
So my question would be, why should one choose BRO over Simba?

(from a practical point of view, from a research point of view, BRO was a helpful steps that enabled Simba)

Side note: it also seems that resets are currently not implemented?

naumix · 2024-12-03T03:11:36Z

Hey! I'm happy you found the scaled architectures working well on your problems. Times of bigger RL models are coming 😄

BRO paved the way for Simba (the same way REDQ paved the way for DroQ), but Simba is more practical/simpler (just replace the network architecture)

Indeed, Simba is based on small changes to the Bronet architecture which is a part of the BRO algorithm. Whereas BRO can be described as SAC + Quantile Q-learning + Bronet + RR = 2 or 10, the Bronet architecture can be used with any other algorithm just like Simba.

So my question would be, why should one choose BRO over Simba?

In terms of applied RL, I think that practitioner would try out a suite of SOTA methods to find which one fits their problems best. Since BRO was shown to perform very competitively in three independent papers, it might be a go-to for at least a few people - since it is not common recorded knowledge that TQC + Simba + RR=10 + policy_delay=10 is a good proposition, it may not be an off-the-shelf choice.

In terms of academic RL, I think that researchers might be interested in using BRO to benchmark future improvements in RL. Since the contemporary reviewing process expects the researchers to use established algorithms as baselines, it might be easier to ask researchers to run BRO than it is to ask to run variations of modern algorithms that are not common (e.g. TQC + Simba + RR=10 + policy_delay=10). What also makes BRO results easier to contextualize is that we ran all environments with single hyperparameters, whereas Simba did some tiny but impactful changes. For example, Simba authors use a single critic network for DMC and MyoSuite, and two critic networks for HumanoidBench. While this seems to be an irrelevant choice, it implies whether the critic will be updated with pessimistic Clipped Double Q-learning targets or optimistic mean of a single target network - and this design choice was shown to have a great impact on DMC performance (https://arxiv.org/abs/2403.00514).

In terms of raw code features, BRO offers the implementation of Bronet, as well as a more general implementation of the pessimistic updates used in algorithms like TD3/SAC/TQC. Our implementation allows users to experiment with different levels of pessimism in Q-value updates, a feature which was shown to greatly impact the performance of actor-critic algorithms (e.g. https://arxiv.org/abs/2102.03765). The implementation of pessimism that we used in BRO is particularly fit for SBX since it is robust with respect to the ensemble size of critics (https://arxiv.org/abs/2110.03375).

Side note: it also seems that resets are currently not implemented?

Yes, I did not implement resets in the pull since I wanted maximal simplicity and it resulted in a semi-elegant solution. However, I am happy to update the pull and add resets - this will have a big performance impact for RR>2 (btw, when comparing BRO to TQC + Simba + RR=10 + policy_delay=10, did you increase BRO replay ratio to 10?)

naumix and others added 21 commits October 31, 2024 14:42

add BRO

595f89d

Update policies.py

1580b13

add

2c0322b

add scripts

f23b80f

scripts

e291f69

add

c2ca2f9

Update policies.py

6599938

add

9e2c6dd

Merge pull request #1 from naumix/wandb

a6fab04

Wandb

add

c5aef93

Update train_torch.py

c40d2d9

Merge pull request #2 from naumix/wandb

23ecc18

Wandb

Update bro.py

a826d1c

compliance with sbx

826bec9

blackify

a9ac250

Merge branch 'master' into add-BRO

7584bc9

Delete scripts/test_bro.py

a8968fd

Update off_policy_algorithm.py

6d0e2bd

Update policies.py

2fc7fba

Update policies.py

b488360

Merge branch 'add-BRO' of https://github.com/naumix/sbx-tinkering int…

bf88774

…o add-BRO

naumix closed this Nov 10, 2024

araffin reopened this Nov 10, 2024

araffin mentioned this pull request Nov 10, 2024

BRO #61

Closed

14 tasks

araffin added 2 commits November 10, 2024 14:41

Fix import, argument and add run test

2bb4cca

Update learning starts to match SAC and others

7d526ce

araffin added 2 commits November 12, 2024 09:28

Sort imports

8a4a8c4

Fix run_test mode

e28060e

araffin mentioned this pull request Nov 25, 2024

Simba SAC #59

Draft

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bigger, Regularized, Optimistic (BRO) #60

Add Bigger, Regularized, Optimistic (BRO) #60

naumix commented Nov 10, 2024

araffin commented Nov 10, 2024

naumix commented Nov 12, 2024

araffin commented Nov 25, 2024

naumix commented Dec 3, 2024 •

edited

Loading

Add Bigger, Regularized, Optimistic (BRO) #60

Are you sure you want to change the base?

Add Bigger, Regularized, Optimistic (BRO) #60

Conversation

naumix commented Nov 10, 2024

Description

Motivation and Context

Types of changes

Checklist:

araffin commented Nov 10, 2024

naumix commented Nov 12, 2024

araffin commented Nov 25, 2024

naumix commented Dec 3, 2024 • edited Loading

naumix commented Dec 3, 2024 •

edited

Loading