reward model 使用do_predict得到的结果和直接用api部署不同 #5967

vxfla · 2024-11-08T12:23:29Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.4.dev0
Platform: Linux-5.15.0-88-generic-x86_64-with-glibc2.35
Python version: 3.9.18
PyTorch version: 2.3.0 (GPU)
Transformers version: 4.41.2
Datasets version: 2.18.0
Accelerate version: 0.32.0
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A100 80GB PCIe
DeepSpeed version: 0.15.0
vLLM version: 0.5.0

Reproduction

如下两种方式对同一批数据打分结果不一致：
方式1：
本地部署一个训练过的reward model
API_PORT=8001 llamafactory-cli api --model_name_or_path xxx --template qwen --stage rm

通过如下方式获取score

    prompt = "You are a helpful assistant."
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": instruct},
        {"role": "assistant", "content": output}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    return text

def get_score(instruct, output):
    text = make_text(instruct, output)
    data = {
                "model": "qwen2.5_3B_style_rm_3k",
                "messages": [
                    text
                ]
            }
    r = requests.post("http://127.0.0.1:8001/v1/score/evaluation", data=json.dumps(data))
    return json.loads(r.text)["scores"][0]```

方式2：
llamafactory-cli train xxx.yaml

yaml内容

model_name_or_path: xxx

stage: rm
do_train: false
do_eval: false
do_predict: true

eval_dataset: xxx
template: qwen
cutoff_len: 1024
max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 16

output_dir: xxx

per_device_eval_batch_size: 1


### Expected behavior

方式1给出的score比较低，且chosen > reject 的比例只有60%
方式2 给出score 数值较高，且chosen > reject 的比例有100%

想知道是我部署出了问题，还是评测出了问题

### Others

_No response_

The text was updated successfully, but these errors were encountered:

github-actions bot added the pending This problem is yet to be addressed label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reward model 使用do_predict得到的结果和直接用api部署不同 #5967

reward model 使用do_predict得到的结果和直接用api部署不同 #5967

vxfla commented Nov 8, 2024

reward model 使用do_predict得到的结果和直接用api部署不同 #5967

reward model 使用do_predict得到的结果和直接用api部署不同 #5967

Comments

vxfla commented Nov 8, 2024

Reminder

System Info

Reproduction