Official Jax/Flax implementation of ARP-DT.
Guide Your Agent with Adaptive Multimodal Rewards (Accepted to NeurIPS 2023)
Changyeon Kim1, Younggyo Seo2, Hao Liu3, Lisa Lee4
Jinwoo Shin1, Honglak Lee5,6, Kimin Lee1
1KAIST 2Dyson Robot Learning Lab 3UC Berkeley
4Google DeepMind 5University of Michigan 6LG AI Research
This implementation has been tested on Nvidia GPUs.
The code supports the following methods and baselines:
- InstructRL: Training transformer policy with pretrained multimodal MAE encoding w/ and w/o instructions.
- ARP-DT: Training transformer policy conditioned on multimodal rewards from pre-trained CLIP and pretrained multimodal MAE encoding.
- ARP-DT+: Training transformer policy conditioned on multimodal rewards from fine-tuned CLIP and pretrained multimodal MAE encoding.
Install the dependencies with pip.
conda create -n arp python=3.8
conda activate arp
cat requirements.txt | xargs -n 1 -L 1 pip install
We support following environments with various types.
Env name | Env type |
---|---|
CoinRun | none, aisc, aisc_gem |
Maze | none, aisc, yellowline, redline_yellowgem, reddiag_redstraight_yellowgem |
For experiments in our paper, we use the following environments.
Task | Train (env_name/env_type) | Test (env_name/env_type) |
---|---|---|
CoinRun | coinrun / none | coinrun / aisc |
CoinRun-bluegem | coinrun / none | coinrun / aisc_gem |
Maze I | maze / aisc | maze / none |
Maze II | maze / yellowline | maze / redline |
Maze III | maze / redline_yellowgem | maze / reddiag_redstraight_yellowgem |
We provide expert datasets used for training in our Procgen experiments.
Task | Link (Google Drive) |
---|---|
CoinRun | CoinRun-bluegem | Link |
Maze I | Link |
Maze II | Maze III | Link |
For those interested in constructing their own dataset, we provide trained PPG checkpoints in ./data/checkpoints
.
You have to collect following demonstrations.
- Training demonstartions. (data_type: train)
- Validation demonstrations. (data_type: val)
- (Optional) Test demonstartions (data_type: test, only for comparing with goal-conditioned baselines.)
Collecting demonstrations via following commands:
cd ./data/PPG
python -m collect_procgen_data --model_dir {path of saved model file} --num_demonstrations {number of demonstrations} --env_name {procgen env name} --env_type {env_type} --data_type {train/eval/test} --num_levels {number of levels} --start_level {level to start} --output_dir {path of saved demonstrations}
# Example in CoinRun environments.
CUDA_VISIBLE_DEVICES=0 python -m collect_procgen_data --model_dir ./checkpoints/coinrun_hard_level500/model1000_IC100007936.jd --num_demonstrations 100 --env_name coinrun --env_type none --data_type train --num_levels 500 --start_level 0 --distribution_mode hard --output_dir ./data/coinrun
Next, you have to label demonstrations with multimodal rewards via following commands:
python -m arp_dt.label_reward --env_name {env_name} --env_type {env_type} --data_dir {data hdf5 file path} --model_type {clip/clip_ft} --model_ckpt_dir {checkpoint path of fine-tuned CLIP.}
# Example in CoinRun experiments for training demonstrations.
CUDA_VISIBLE_DEVICES=0 python -m arp_dt.label_reward --env_name coinrun --env_type none --data_dir ./data/coinrun/coinrun_hard_level0to500_num500_frame4/data_train.hdf5 --model_type clip_ft --model_ckpt_dir ./data/coinrun/coinrun_hard_level0to500_num500_frame4/clip_ft_checkpoints/best_checkpoint.pt
You can fine-tune CLIP using expert demonstrations from training environments via following commands:
python3 -m finetune_module.finetune --data.path {training data path} --output_dir {directory for saving checkpoints} --env_name {env_name} --data.train_env_type {env_type} --data.num_demonstrations {number of training demonstrations} --lambda_id {scaling hyperparameter for inverse dynamics loss}
# Example in CoinRun experiments.
CUDA_VISIBLE_DEVICES=0 python3 -m finetune_module.finetune --data.path ./data/maze --default_root_dir ./debug --epochs 20 --model_type clip_multiscale_ensemble --game_name maze --data.train_env_type aisc --data.image_key "ob" --data.num_demonstrations 500 --lambda_id 1.5
Experiments can be launched via the following commands.
For training InstructRL|GC-InstructRL agents, set {use VL model} as False.
If you want to train/evaluate agents with goal images, you have to collect demonstrations for test (Please refer to 1.3.)
sh ./jobs/train_procgen.sh {data_dir} {env_name} {env_type of training environemts} {env_type of evaluation environments} {augmentation} {use VL model: True | False} {VL model_type: BC | clip | clip_ft | clip_goal_conditioned | GCBC} {VL model checkpoint path (only for ARP-DT+)} {seed} {comment on experiment} {lambda for return prediction} {evaluation with goal images}
# Example in CoinRun experiments.
CUDA_VISIBLE_DEVICES=0,1 sh ./jobs/train_procgen.sh ./data/coinrun coinrun none aisc "color_jitter, rotate" True clip "" 0 "ARP-DT-coinrun" 0.01 False
You can evaluate our agent in new environments via following commands:
sh ./jobs/eval_procgen.sh {model checkpoint path} {data_dir} {env_name} {env_type of training environments} {env_type of evaluation environments} {use levels used for collecting expert demonstrations: True | False} {use VL model: True | False} {VL model_type: BC | clip | clip_ft | clip_goal_conditioned | GCBC} {VL model checkpoint path (only for ARP-DT+)} {comment on experiment}
# Example for evaluating in CoinRun test environments.
CUDA_VISIBLE_DEVICES=0 sh ./jobs/eval_procgen.sh /checkpoints/model_epoch49.pkl ./data/coinrun/ coinrun none aisc_gem True True clip "" "ARP-DT-coinrun_test"
- Transformer agent and training codes are largely based on InstructRL.
- Multimodal MAE implementation is largely based on m3ae
- CLIP implementation is largely based on scenic.
@inproceedings{
kim2023guide,
title={Guide Your Agent with Adaptive Multimodal Rewards},
author={Kim, Changyeon and Seo, Younggyo and Liu, Hao and Lee, Lisa and Shin, Jinwoo and Lee, Honglak and Lee, Kimin},
booktitle={Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS)},
year={2023}
}