GPT-4 Evaluation

We evaluated the models (see checkpoints) trained on OpenRLHF using GPT-4 based on 160 prompts from LMSys. The basic idea was to compare the RLHF fine-tuned model against the OCRA SFT model, using the following method:

Generate responses

deepspeed batch_inference.py
    --eval_task generate \
    --pretrain meta-llama/Llama-2-7b-hf \
    --bf16 \
    --max_len 2048 \
    --dataset ./data/benchmark.jsonl \
    --dataset_probs 1.0 \
    --max_samples 160 \
    --zero_stage 0 \
    --micro_batch_size 16 \
    --greedy_sampling \
    --load_model {model_path} \
    --output_path {output_json_path}
    # --enable_dt (for Decision Transformer models)

GPT-4 evaluation

pip install alpaca_eval

# convert .jsonl to .json
sed -i '1s/^/[/; $!s/$/,/; $s/$/]/' {output_json_path}
sed -i 's/"input"/"instruction"/g' {output_json_path}

alpaca_eval --model_outputs {output_json_path} --annotators_config alpaca_eval_gpt4 --reference_outputs sft.json

Results

Below are the results compared to the OCRA SFT model

	Win	Lose	Tie
PPO
DPO
DT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GPT-4 Evaluation

Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

GPT-4 Evaluation

Results