We evaluated the models (see checkpoints) trained on OpenRLHF using GPT-4 based on 160 prompts from LMSys. The basic idea was to compare the RLHF fine-tuned model against the OCRA SFT model, using the following method:
- Generate responses
deepspeed batch_inference.py
--eval_task generate \
--pretrain meta-llama/Llama-2-7b-hf \
--bf16 \
--max_len 2048 \
--dataset ./data/benchmark.jsonl \
--dataset_probs 1.0 \
--max_samples 160 \
--zero_stage 0 \
--micro_batch_size 16 \
--greedy_sampling \
--load_model {model_path} \
--output_path {output_json_path}
# --enable_dt (for Decision Transformer models)
- GPT-4 evaluation
pip install alpaca_eval
# convert .jsonl to .json
sed -i '1s/^/[/; $!s/$/,/; $s/$/]/' {output_json_path}
sed -i 's/"input"/"instruction"/g' {output_json_path}
alpaca_eval --model_outputs {output_json_path} --annotators_config alpaca_eval_gpt4 --reference_outputs sft.json
Below are the results compared to the OCRA SFT model
Win | Lose | Tie | |
---|---|---|---|
PPO | |||
DPO | |||
DT |