A PyTorch
Implementation for experiments in
"Identifying and Addressing Delusions for Target-Directed Decision Making"
authored by Mingde "Harry" Zhao, Tristan Sylvain, Doina Precup, Yoshua Bengio
This repo was implemented by Harry Zhao (@PwnerHarry), mostly adapted from Skipper
This work was done during Harry's Mitacs Internship at Borealis AI (RBC), under the mentorship of Tristan Sylvain (@TiSU32).
-
Create a virtual environment with conda or venv (we used Python 3.10)
-
Install PyTorch according to the official guidelines, make sure it recognizes your accelerators
-
pip install -r requirements.txt
to install dependencies
tensorboard --logdir=tb_records
run_minigrid_mp.py
: a multi-processed experiment initializer for Skipper variants
run_minigrid.py
: a single-processed experiment initializer for Skipper variants
run_leap_pretrain_vae.py
: a single-processed experiment initializer for pretraining generator for the LEAP agent
run_leap_pretrain_rl.py
: a single-processed experiment initializer for pretraining distance estimator (policy) for the LEAP agent
Please read carefully the argument definitions in runtime.py
and pass the desired arguments.
Use --hindsight_strategy
to specify the hindsight relabeling strategy. The options are:
-
future
: same as "future" variant in paper -
episode
: same as "episode" variant in paper -
pertask
: same as "pertask" variant in paper -
future+episode
: correspond to "F-E" variant in paper -
future+pertask
: correspond to "F-P" variant in paper -
future+episode@0.5
: correspond to "F-(E+P)" variant in paper, where0.5
controls the mixture ratio ofpertask
To use the "generate" strategy for estimator training, use --prob_relabel_generateJIT
to specify the probability of replacing the relabeled target:
-
--hindsight_strategy future+episode --prob_relabel_generateJIT 1.0
: correspond to "F-G" variant in paper -
--hindsight_strategy future+episode --prob_relabel_generateJIT 0.5
: correspond to "F-(E+G)" variant in paper -
--hindsight_strategy future+episode@0.333 --prob_relabel_generateJIT 0.25
: correspond to "F-(E+P+G)" variant in paper
--game SwordShieldMonster --size_world 12 --num_envs_train 50
:game
can be switched withRandDistShift (RDS)
andsize_world
should >= 8
-
There is a potential
CUDA_INDEX_ASSERTION
error that could cause hanging at the beginning of the *Skipper *runs. We don't know yet how to fix it -
The Dynamic Programming solutions for environment ground truth are only compatible with deterministic experiments