Traditionally, we learn a policy, and action is determined for every time-step. However, in many cases, it is also viable to simply repeat an action for multiple time-steps rather than determining a new action every time. This repeat factor is usually manually tuned and is kept constant. We hypothesize that keeping it constant may not be an ideal-policy as there could be scenarios in an environment where we need fine-step control as well as there could be scenarios where a larger-step control is feasible. For example, if we think of Lunar-Lander, we may need fine-step control as we are closer to the ground and attempting to land as compared to moments when we are high up in the space and large-repeat action may be feasible.
In this work, we learn a policy that learns an action as well as the time-step for which this action should be repeated. This gives the policy the ability to have large as well as fine-step control. We also hypothesize that learning to repeat an action may also lead to better sample efficiency. Our work utilizes "td3" as a core learning algorithm and updates it to have a q-value for each action-repeat, thereby, we call it "variable-td3"
.
Slide |
Status:
- Work has been halted.
- Code provided as it is and no major updates expected.
- This work is similar to the paper "Learning to repeat: fine grained action repetition for deep reinforcement learning"
-
Install conda
-
For classic tasks, do following:
conda env create -f env.yml # creates env with name "vtd3"
-
Optional
: For gym mujoco env. do following:- Requires mjpro 150 and mujoco license
conda activate vtd3 # activates conda-env. in (2)
pip install 'gym[mujoco]'
-
Optional
: For dm_control, create a separate conda env. having mujoco 2.0 :- Requires mujoco200 and mujoco license
conda env create -f env_mj2.yml # creates env with name "vtd3_mj2"
Having trouble during installation ?, please refer here
-
$ conda activate <env_name>
-
Train:
$ python main.py --case classic_control --env Pendulum-v0 --opr train
-
Test:
$ python main.py --case classic_control --env Pendulum-v0 --opr test
Required Arguments Description --case {classic_control,box2d,mujoco,dm_control}
It's used for switching between different domains(and configs) --env
Name of the environment
Environments corresponding to ease case:
classic_control
: {Pendulum-v0, MountainCarContinuous-v0}
box2d
: {LunarLanderContinuous-v2, BipedalWalker-v3, BipedalWalkerHardcore-v3}mujoco
: (refer here)
dm_control
: (refer here)--opr {train,test}
select the operation to be performed -
Visualize Results:
tensorboard --logdir=./results
-
Summarize plots in Plotly:
$ cd scripts $ python summary_graphs.py --logdir=../results/classic_control --opr extract_summary $ python summary_graphs.py --logdir=../results/classic_control --opr plot