Abstract: Natural language-conditioned reinforcement learning (NLC-RL) empowers embodied agent to complete various tasks following human instruction. However, the unbounded natural language examples still introduce much complexity for the agent that solves concrete RL tasks, which can distract policy learning from completing the task. Consequently, extracting effective task representation from human instruction emerges as the critical component of NLC-RL. While previous methods have attempted to address this issue by learning task-related representation using large language models (LLMs), they highly rely on pre-collected task data and require extra training procedure. In this study, we uncover the inherent capability of LLMs to generate task representations and present a novel method, in-context learning embedding as task representation (InCLET). InCLET is grounded on a foundational finding that LLM in-context learning using trajectories can greatly help represent tasks. We thus firstly employ LLM to imagine task trajectories following the natural language instruction, then use in-context learning of LLM to generate task representations, and finally aggregate and project into a compact low-dimensional task representation. This representation is then used to train a human instruction following agent. We conduct experiments on various embodied control environments and results show that InCLET creates effective task representations. Furthermore, this representation can significantly improve the RL training efficiency, compared to the baseline methods.
- Please finish the following steps to install conda environment and related python packages
- Package install
pip install -r requirements.txt
- The environments used in this work require MuJoCo, CLEVR-Robot Environment and LLM as dependecies. Please setup them following the instructions:
- Instructions for MuJoCo: https://mujoco.org/
- Instructions for CLEVR-Robot Environment: https://github.com/google-research/clevr_robot_env
- Llama3-8B: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
Before training the Goal-Conditioned-Policy (GCP), we need to train a TL translator using the process described in step 4. When the traning of TL translator model is completed, please place the model in the designated location:
<project_path>/models/
- Before beginning the training process, please ensure that you have downloaded Llama3-8B from https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct and change the path in file:
algorithms/translation/llm_encoder.py
- FrankaKitchen
- Instruction following policy: Run this command in shell
python kitchen_train.py --seed <SEED>
- The models of goal-conditioned-policy will be saved at
kitchen_model
. - The tensorboard log of goal-conditioned-policy will be saved at
kitchen_train
. - The evaluation result of goal-conditioned-policy will be saved at
kitchen_callback
.
- CLEVR-Robot
- Instruction following policy: Run this command in shell
python ball_train.py --seed <SEED>
- The models of goal-conditioned-policy will be saved at
ball_model
. - The tensorboard log of goal-conditioned-policy will be saved at
ball_train
. - The evaluation result of goal-conditioned-policy will be saved at
ball_callback
.