Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] What is the difference between a custom feature extractor and a custom policy? #347

Closed
2 tasks done
outdoteth opened this issue Mar 9, 2021 · 5 comments · Fixed by #354
Closed
2 tasks done
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@outdoteth
Copy link

outdoteth commented Mar 9, 2021

Question

When should I use a custom feature extractor vs a custom policy? It's a little unclear here in the docs on what the differences are. If I want to use a custom neural net, should I replace the feature extractor, define it as a custom policy or, should I define both a custom policy AND a custom feature extractor?

Checklist

  • I have read the documentation (required)
  • I have checked that there is no similar issue in the repo (required)
@outdoteth outdoteth added the question Further information is requested label Mar 9, 2021
@araffin araffin added the documentation Improvements or additions to documentation label Mar 10, 2021
@Miffyli
Copy link
Collaborator

Miffyli commented Mar 10, 2021

Feature extractors only concern themselves with processing [whatever shaped] inputs into nice 1D vectors. Policies then take this 1D vector and map it into value/pi predictions et cetera. Policy holds the feature extractor and also handles initializations and such.

So, to change most of the network, you probably want to define a new feature extractor. If your observations are 1D vectors, then you can use net_arch argument to change the network. If you want something more custom, however, then you need to create a custom policy.

If some part of the docs was unclear, do point it out so it can be refined for clarity.

@araffin
Copy link
Member

araffin commented Mar 10, 2021

I think we should update the doc. I created two diagrams to explain things faster:

CustomPolicy
SB3Policy

The feature extractor is usually shared between networks to save computation (can be disabled) and the network architecture comes afterward.
As mentioned in the doc, SB3 does an abuse of language when we talk about "Policy", in the code, it refers to all the networks + optimizer + target networks, whereas in RL it refers to the actor part only (the one taking actions).

@outdoteth
Copy link
Author

outdoteth commented Mar 10, 2021

Ok I see. So if I want true customisation I should follow the "Advanced" example here: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#on-policy-algorithms

And then in addition to that I should also create a custom feature extractor.

So to summarise, if I create a feature extractor F and a custom policy P then the parameters for F will be shared between both the actor and the critic - that is to say, the actor and critic will use the same F. Then inside of P there are no constraints other than there must be two outputs (one for the value function and one for the policy).

@Miffyli
Copy link
Collaborator

Miffyli commented Mar 11, 2021

If you define a custom policy there is no need to do a custom feature extractor. The separation is merely done in default policies to make it clear which part processes the input into 1D vector and which part maps into pi/value, and also make it easier to change this preprocessing network without having to touch other parts.

For P the only real constraint is that you implement the public functions correctly (see the original ActorCriticPolicy). Other than that you are free to do almost anything. The example behind the link you shared is somewhat limited but a good starting point.

@araffin araffin mentioned this issue Mar 16, 2021
14 tasks
@pengzhi1998
Copy link

pengzhi1998 commented May 24, 2022

Thank you for the clear explanations!

However, I'm still confused about a few points here. I'm using PPO actually. And from the paper, it seems that if training a network that shares parameters in actor and critic, the loss should be different:
Screenshot from 2022-05-24 16-58-00

If I use a CNN as a features extractor as a shared part before the actor/critic network and train all the three parts together, do I need to change the loss? Or, maybe a proper way is to just write a custom policy but not sharing the CNN's parameters?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants