Skip to content

Latest commit

 

History

History
58 lines (32 loc) · 4.63 KB

FeatureControlasIntrinsicMotivation.md

File metadata and controls

58 lines (32 loc) · 4.63 KB

Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning


Authors:

Nat Dilokthanakul, Christos Kaplanis, Nick Pawlowski, Murray Shanahan

Important Notes:

In a recent version of the paper, the authors present recent evidence about an A3C agent trained with shaped rewards and intrinsic motivation capable of achieving the same performance they reported in Montezuma's revenge. These results raised question about the benefit of the proposed hierarchical approach given that the decision made by the Meta-Controller may not contribute to the success of the agent. It may be that the reported results come more from the use of additional auxiliary rewards than from the architecture itself.

Summary:

  • Considering that one of the main problems in HRL is to find useful and generalizable skills, the authors propose the ability to control features of the environment as an inherently powerful skill for an agent to have.

  • The paper presents an agent which is intrinsically motivated to control aspects of its environment.

  • The agent is inspired by FeUdal RL (discussed here) and shares its two-level hierarchy:

    1. Meta-Controller: learns to maximize extrinsic reward and tells the Sub-Controller which feature of the environment it should control.
    2. Sub-Controller: receives an intrinsic reward for successfully changing the given feature together with an extrinsic reward from the environment.
  • The main difference between this paper and the papers from Kulkarni et al. here and Vezhnevets et al. here will be about the way the subgoals are designed and incorporated in the learning process.

Model:

The Sub-Controller is responsible of choosing actions and interacting with the environment. The Meta-Controller operates at a lower temporal resolution and influences the behavior of the Sub-Controller by setting subgoals gt. The Metal-Controller also provides an intrinsic reward signal to the Sub-Controller for successfully completing the subgoal. The behavior of the Sub-Controller is then biased towards completing the provided subgoals while the Meta-Controller learns to select sequences of goals which maximize the cumulative extrinsic reward.

  • The authors propose and evaluate two methods for delivering subgoals to the Sub-Controller and computing the intrinsic reward. In both methods, the proposed goal gt is given as a one-hot vector representing which "pixel patch" or which "feature map" should be modified.

    1. Pixel control: ability to control a given subset of pixels in the visual input. The authors discretized the input image into smaller "patches" and define the intrinsic reward as the difference between two consecutive frames. The Sub-Controller is then encouraged to maximize the change in values of pixels in the given patch relative to the entire screen.

    2. Feature control: ability to control the activation of specific neurons in the model. The intrinsic reward is given as a feature selectivity measure on the second convolutional layer. This approach should give the model a more "flexible and abstract" control of the environment.

Experiments:

They evaluated on different Atari games and compared with the Feudal Network and Option-Critic architectures. The authors present different type of experiments in order to evaluate different characteristics of the proposed architecture.

Results:

  • It seems like introducing a certain proportion of intrinsic reward in the Sub-Controller has a positive effect in sparse reward environments.

  • Pixel-controlled agents learn faster than feature-controlled ones. However, this seems to happen because the model needs to identify useful features before the influence of the Meta-Controller becomes meaningful. Once it does, the generated subgoals are of higher quality than the hard-coded ones.

  • The authors had some problems tuning the length of the BPTT roll-outs for the LSTM layers. Longer roll-outs performed better in some environments like Montezuma's revenge, but deteriorated performance in others like Frostbite.

  • The model obtained similar performance in Montezuma's revenge when compared to FeUdal Networks but learning much more quickly. It doesn't perform as well in other environments.

Future work:

  • Incorporate termination conditions for the sub-policies, thus allowing the instructions from the Meta-Controller to be of variable length and more temporally precise.