We provide a simple example of PPO finetuning on a LLM which is asked to generate specific token sequences. As in RLHF, each token is a different action.
We use LoRA through the Peft library for lightweight finetuning. We leverage Lamorel's custom modules and updaters to add a value head on top of the LLM and finetune all the weights using the PPO loss.
1.Install required packages: pip install -r requirements.txt
To launch the example using a single GPU on a local machine:
- Spawn both processes (RL collecting data and LLM):
python -m lamorel_launcher.launch \
--config-path PROJECT_PATH/examples/PPO_finetuning/ \
--config-name PROJECT_PATH/examples/PPO_finetuning/local_gpu_config \
rl_script_args.path=PROJECT_PATH/examples/PPO_finetuning/main.py \
rl_script_args.output_dir=YOUR_OUTPUT_DIR \