(1) The Eurus-RM-7b cannot predict the score correctly. (2) Eurus-7b-kto performs poorly in coding. #10

liuqi8827 · 2024-05-28T06:21:56Z

The Eurus-RM-7b cannot predict the score correctly.

I run:

from transformers import AutoTokenizer, AutoModel
import torch
def test(model_path):
    dataset = [  # cases in webgpt; we use the same template as Mistral-Instruct-v0.2
        {
            "chosen": "[INST] Sural relates to which part of the body? [/INST] The sural region is the muscular swelling of the back of the leg below the knee, formed chiefly by the bellies of the gastrocnemius and soleus muscles [1,2].",
            "rejected": "[INST] Sural relates to which part of the body? [/INST] The Sural nerve runs down the side of the leg near the small saphenous vein, then passes forward below the lateral malleolus and continues on the outside of the foot as the lateral dorsal cutaneous nerve, which then communicates with the intermediate dorsal cutaneous nerve, which branches off to the side of the foot. [1]",
        }
    ]

    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
    with torch.no_grad():
        for example in dataset:
            inputs = tokenizer(example["chosen"], return_tensors="pt")
            chosen_reward = model(**inputs).item()
            inputs = tokenizer(example["rejected"], return_tensors="pt")
            rejected_reward = model(**inputs).item()
            print(f"chosen_reward: {chosen_reward} | rejected_reward: {rejected_reward} | diff: {chosen_reward - rejected_reward}")

test("/workspace/xxx/models/Eurus-RM-7b")

It's output is:
chosen_reward: -626.8788452148438 | rejected_reward: -405.09423828125 | diff: -221.78460693359375
The chosen_reward is smaller than that of rejected_reward. However, it shows that in (https://huggingface.co/openbmb/Eurus-RM-7b), the Output: 47.4404296875
Can you give me some suggestions?

The text was updated successfully, but these errors were encountered:

liuqi8827 · 2024-05-29T07:13:50Z

"Eurus-7b-kto" performs poorly in coding.

I run bash run.sh
I got the result in /Eurus/eval/result/mbpp/result.txt. It shows:
{'accuracy': 0.0, 'exec_error': 0.0, 'format_error': 100.0}
However, the paper Table 3 reported:

Thus, result.txt does not match the paper's performance.

I also got Eurus/eval/result/leetcode/samples.jsonl and /Eurus/eval/result/human_eval/samples.jsonl.
Can you tell me how to reproduce the performances in the paper Table 3.

lifan-yuan · 2024-05-29T14:59:20Z

Hi,

Thanks for your interest and sorry for the trouble.

Re RM: the case in hf is outdated because we previously adopted an incorrect template -- it should be Mistral template ([INST], [/INST]), but we made a typo and used ([INST], [\INST]) in hf page, which leads to incorrect results. I haven't got time to test it again yet but your use case as shown above seems correct.

Re evaluation: The previous version of eval code may be buggy for some reason. We have updated the code and it should be able to reproduce the results.

liuqi8827 changed the title ~~The Eurus-RM-7b cannot predict the score correctly~~ (1) The Eurus-RM-7b cannot predict the score correctly. (2) Eurus-7b-kto performs poorly in coding. May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(1) The Eurus-RM-7b cannot predict the score correctly. (2) Eurus-7b-kto performs poorly in coding. #10

(1) The Eurus-RM-7b cannot predict the score correctly. (2) Eurus-7b-kto performs poorly in coding. #10

liuqi8827 commented May 28, 2024 •

edited

Loading

liuqi8827 commented May 29, 2024

lifan-yuan commented May 29, 2024

(1) The Eurus-RM-7b cannot predict the score correctly. (2) Eurus-7b-kto performs poorly in coding. #10

(1) The Eurus-RM-7b cannot predict the score correctly. (2) Eurus-7b-kto performs poorly in coding. #10

Comments

liuqi8827 commented May 28, 2024 • edited Loading

liuqi8827 commented May 29, 2024

lifan-yuan commented May 29, 2024

liuqi8827 commented May 28, 2024 •

edited

Loading