Code for "Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance"
[Project Website] [Paper] [OpenReview]
Jesse Zhang1, Jiahui Zhang1, Karl Pertsch1, Ziyi Liu1, Xiang Ren1, Minsuk Chang2, Shao-Hua Sun3, Joseph J. Lim4
1University of Southern California 2Google AI 3National Taiwan University 4KAIST
This is the official PyTorch implementation of CoRL 2023 paper "Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance"
This is the code for running simulated experiments on our ALFRED benchmarks.
The environment can be installed either through pip or conda.
Pip install:
pip3 install -r requirements.txt
OR
Conda install:
conda env create -f environnment.yml
Then, you must pip install the boss package and the spacy en_core_web_md
for language-related things:
conda activate boss
pip install -e .
python -m spacy download en_core_web_md
The ALFRED environment requires some additional dependencies -- installing may require sudo access. Read the ALFRED README for more details if this doesn't work:
conda activate boss
cd boss/alfred/scripts
sh install_deps.sh
All results will be written to WandB. Before running any of the commands below, create an account and then change the WandB entity and project name at the top of boss/utils/wandb_info.py
to match your account and name for the project holding the runs for this repo.
Add the location in which you git cloned BOSS to your ~/.bashrc
:
export BOSS=[BOSS_DOWNLOAD_LOCATION]
You need to pre-train models to run zero-shot or finetuning experiments. If you don't want to pre-train a model yourself, you can skip to step 3 as you don't need the pre-training dataset file.
Download the ALFRED dataset here: Google Drive Link.
You can use Gdown to directly download the dataset to your server/computer at the desired location (19GB download):
cd [BOSS_REPO_LOCATION]
mkdir data
cd data
pip3 install gdown
gdown 1ZgKDgG9Fv491GVb9rxIVNJpViPNKFWMF
Once the dataset is downloaded (boss_offline_dataset.tar.gz
) simply untar it (40GB after extraction):
mkdir data
cd data
tar -xvzf boss_offline_dataset.tar.gz
cd ..
To run evals and fine-tuning/skill bootstrapping experiments, you must extract ALFRED evaluation data we have processed (Google Drive Link):
cd [BOSS_REPO_LOCATION]
cd boss/alfred/data
gdown 1MHDrKSRmyag-DwipyLj-i-BbKU_dxbne
tar -xvzf json_2.1.0_merge_goto.tar.gz
We log using WandB. First create a wandb account if you don't already have one here.
Then, run wandb login
to login to your account on the machine.
Finally, fill in WANDB_ENTITY_NAME, WANDB_PROJECT_NAME
in the file utils/wandb_info.py
where WANDB_ENTITY_NAME
refers to your wandb account name and WANDB_PROJECT_NAME
is the name of the wandb project you want to log results to.
You can either pre-train a model yourself or download a pre-trained checkpoint. Pre-trained model checkpoints can be found here: Google Drive Link.
Otherwise, run the following command from the base BOSS repo location to train our model, BOSS:
python boss/pretrain.py --experiment_name [WANDB_EXP_NAME] --run_group [WANDB_RUN_GROUP]
--experiment_name
and --run_group
are used to name the experiment and group of runs in WandB. Experiments in the same run_group
will appear grouped together on wandB for easier comparison, but this command is completely optional.
All models are saved to saved_models/
by default. You can add a --save_dir
command to specify a different location.
We used facebook's LLAMA 13B open-source model for the paper. This repo now supports LLaMA-3-8B as LLAMA 13B is completely deprecated (no Meta download links) and also unsupported by latest huggingface versions. To install, follow these shortened instructions:
- First, to the llama website to request access to meta llama.
- Then, request the permission on huggingface.
- Finally,
huggingface-cli login
and use your huggingface API key.
To run BOSS' skill bootstrapping, run the following code. Make sure you have enough GPU memory to use LLAMA-8B.
If not, you can play with model sizes, which GPUs to expose, LLM batch size options, or adding 8-bit inference to boss/models/large_language_model.py
, etc.
There are 4 total floorplans, with 10 evaluation tasks each for a total of 40 tasks. Choose which one to train on with --which_floorplan [0-3]
:
python boss/run_skill_bootstrapping.py --which_floorplan [0-3] --experiment_name [WANDB_EXP_NAME] --run_group [WANDB_RUN_GROUP] --load_model_path [PRETRAINED_MODEL_LOCATION]/alfred_action_model_149.pth --llm_gpus [LLM_GPU]
We train on all 4 floorplans and aggregate results across all.
NOTE: Currently, due to possible LLM differences or other code changes that occurred in the massive refactoring effort during cleanup, the results don't fully reproduce the performance in the original paper. The code works and the agent learns new skills, but the performance isn't as good. I am investigating this but have released the code so others can build upon BOSS in the meantime.
To fine-tune an oracle model (given ground truth evaluation primitive skill guidance and rewards) on a specific floorplan, run:
python boss/run_skill_bootstrapping.py --gpus 2 --which_floorplan [0-3] --experiment_name [WANDB_EXP_NAME] --run_group [WANDB_RUN_GROUP] --load_model_path [PRETRAINED_MODEL_LOCATION]/alfred_action_model_149.pth --no_bootstrap True
Run Saycan+P (+P = with our skill proposal mechanism, which works better than normal SayCan). load_model_path
should be set to an offline pre-trained checkpoint (i.e., for regular SayCan pre-trained on the same offline data, point it to the pre-trained models linked earlier--same as the ones used for bootstrapping).
python boss/saycan_eval.py --which_floorplan [0-3] --experiment_name [WANDB_EXP_NAME] --run_group [WANDB_RUN_GROUP] --load_model_path [MODEL_PATH] --llm_gpus [GPU]
To run regular SayCan without our skill proposal mechanism, just add the flag --skill_match_with_dataset False
In the paper, our results list IQMs of oracle normalized returns and oracle normalized success rates, averaged over all 4 floorplans.
For reference, our numbers we used for the paper: oracle return was: 1.21, oracle success rate was 11.7%. You should get approximately the same numbers if you re-run the oracle.
@inproceedings{
zhang2023bootstrap,
title={Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance},
author={Jesse Zhang and Jiahui Zhang and Karl Pertsch and Ziyi Liu and Xiang Ren and Minsuk Chang and Shao-Hua Sun and Joseph J Lim},
booktitle={7th Annual Conference on Robot Learning},
year={2023},
url={https://openreview.net/forum?id=a0mFRgadGO}
}