- Recent related work in DRL:
- Unifying Count-Based Exploration and Intrinsic Motivation. (2016)
- Variational information maximization for intrinsically motivated reinforcement learning. (2015)
- From the paper: "... intrisic reward signal based on prediction error of the agent's knowledge about its environment that scales to high-dimensional continuous state spaces like images, bypasses the hard problems of predicting pixels and is unaffected by the unpredictable aspects of the environment that do not affect the agent."
- They evaluate how can curiosity be used to transfer knowledge between different scenarios.
-
Intrinsic reward is generated based on how hard it is for the agent to predict the outcome of its own actions. However, the systems attempts to predict only those changes in the environment that could possibly be caused by its actions (or affect the agent somehow) and ignore the rest.
-
Instead of attempting to predict raw sensory information (as pixels), they try to predict a feature representation where only relevant information is represented.
-
A neural network is trained using self-supervision to learn the inverse-dynamics: learning which action was taken given previous and current states. Since only the action is predicted, the NN has no incentive to learn non-relevant features which do not affect the agent or which are not controlled by its actions.
-
A second model (forwards dynamics model) is then trained to predict the feature representation of the next state given the current state representation and the selected action. The prediction error is then given to the agent as an intrinsic reward to encourage curiosity.
-
The authors mentions that there are currently no known computationally feasible mechanism for measuring learning progress. (?)
-
The authors propose to divide all sources that can modify the agent's observations into the following three cases and explain that a good feature space for curiosity should model (1) and (2) and be unaffected by (3):
- Things that can be controlled by the agent.
- Things that the agent cannot control but that can affect the agent.
- Things out of the agent's control and not affecting the agent.
-
An interesting direction of future research is to use the learned exploration behavior/skill as a motor primitive/low- level policy in a more complex, hierarchical system. For example, our VizDoom agent learns to walk along corridors instead of bumping into walls. This could be a useful primitive for a navigation system.
-
While the rich and diverse real world provides ample opportunities for interaction, reward signals are sparse. Our approach excels in this setting and converts unexpected interactions that affect the agent into intrinsic rewards. How- ever our approach does not directly extend to the scenarios where “opportunities for interactions” are also rare. In theory, one could save such events in a replay memory and use them to guide exploration. However, we leave this extension for future work.
-
In Mario our agent crosses more than 30% of Level-1 without any rewards from the game. One reason why our agent is unable to go beyond this limit is the presence of a pit at 38% of the game that requires a very specific sequence of 15-20 key presses in order to jump across it. If the agent is unable to execute this sequence, it falls in the pit and dies, receiving no further rewards from the environment. Therefore it receives no gradient information indicating that there is a world beyond the pit that could potentially be explored. This issue is somewhat orthogonal to developing models of curiosity, but presents a challenging problem for policy learning.