Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the shared feature extractor for PPO #922

Closed
pengzhi1998 opened this issue May 26, 2022 · 4 comments
Closed

Question about the shared feature extractor for PPO #922

pengzhi1998 opened this issue May 26, 2022 · 4 comments
Labels
question Further information is requested

Comments

@pengzhi1998
Copy link

Question

I'm extremely sorry to keep bothering you. I found one issue a year ago regarding the custom feature extractor and custom policy. But I'm still confused about a few points here. I'm using PPO actually. I'd like to add a CNN part and feed its output vector to the linear net_arch and train them together. But from the paper, it seems that if training a network that shares parameters in actor and critic, the loss should be different:
Screenshot from 2022-05-24 16-58-00

If I use the CNN as a shared features extractor before the actor/critic network and train all the three parts (shared custom feature extractor, separated actor and critic networks) together, do I need to change the loss manually? Or, maybe a proper way is to just write a custom policy without sharing the CNN's parameters?

I REALLY REALLY appreciate your help!

Checklist

  • [yes] I have read the documentation (required)
  • [yes] I have checked that there is no similar issue in the repo (required)
@pengzhi1998 pengzhi1998 added the question Further information is requested label May 26, 2022
@Miffyli
Copy link
Collaborator

Miffyli commented May 26, 2022

Hey. Generally people have used the same loss function when using shared parameters, but you are right; having two losses updating the same parameters may cause some problems (i.e., slower learning). Hence I would recommend using the default parameters, which are generally ok. In some older results papers claimed they got better results with sharing CNN between policy and value heads, since they are both doing a similar task and "supporting each other", but your mileage may vary :).

But yes, if you are worried of this, you should create a custom policy network where there are two separate feature extractors. If you do not CNNs, you can use the net_arch argument.

@araffin araffin closed this as completed Jun 10, 2022
@pengzhi1998
Copy link
Author

Sorry to bother you again. But would using the two losses to update the shared parameters cause the problem of instability?

@Miffyli
Copy link
Collaborator

Miffyli commented Jun 20, 2022

Potentially, but it might still be beneficial. Phasic policy gradient is one such work which discusses this.

@pengzhi1998
Copy link
Author

@Miffyli I'm extremely sorry to keep bothering you. But I'm still confused about a few things regarding the custom policy.

From your previous answer in this issue:

But yes, if you are worried of this, you should create a custom policy network where there are two separate feature extractors.

Are you referring to this custom policy network? But from issue #347, you mentioned that:

If you define a custom policy there is no need to do a custom feature extractor.

I'm kind of confused about the two comments: when defining a custom policy, do I need to do a custom feature extractor or two separate custom feature extractors explicitly?

Actually, I'm trying to construct an attention block (it doesn't include sequential parts but is not a simple MLP either) for PPO. Its actor and critic networks (both include the attention block) don't share parameters with each other. So I'm thinking a custom feature extractor might not be enough (from the note in your user guide, by default, the custom feature extractor is shared between actor and critic network for on-policy algorithms).

But if using this advanced example, it seems that the network architecture is limited to MLP (the advanced example is using _build_mlp_extractor). I'm thinking of borrowing from your advanced example and the _build_mlp_extractor function, but directly define my attention block for the policy and value networks in the forward_actor and forward_critic functions in CustomNetwork class. May I have your suggestions?

Thank you again for your great help!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants