Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(1) The Eurus-RM-7b cannot predict the score correctly. (2) Eurus-7b-kto performs poorly in coding. #10

Open
liuqi8827 opened this issue May 28, 2024 · 2 comments

Comments

@liuqi8827
Copy link

liuqi8827 commented May 28, 2024

The Eurus-RM-7b cannot predict the score correctly.

  1. I run:
from transformers import AutoTokenizer, AutoModel
import torch
def test(model_path):
    dataset = [  # cases in webgpt; we use the same template as Mistral-Instruct-v0.2
        {
            "chosen": "[INST] Sural relates to which part of the body? [/INST] The sural region is the muscular swelling of the back of the leg below the knee, formed chiefly by the bellies of the gastrocnemius and soleus muscles [1,2].",
            "rejected": "[INST] Sural relates to which part of the body? [/INST] The Sural nerve runs down the side of the leg near the small saphenous vein, then passes forward below the lateral malleolus and continues on the outside of the foot as the lateral dorsal cutaneous nerve, which then communicates with the intermediate dorsal cutaneous nerve, which branches off to the side of the foot. [1]",
        }
    ]

    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
    with torch.no_grad():
        for example in dataset:
            inputs = tokenizer(example["chosen"], return_tensors="pt")
            chosen_reward = model(**inputs).item()
            inputs = tokenizer(example["rejected"], return_tensors="pt")
            rejected_reward = model(**inputs).item()
            print(f"chosen_reward: {chosen_reward} | rejected_reward: {rejected_reward} | diff: {chosen_reward - rejected_reward}")

test("/workspace/xxx/models/Eurus-RM-7b")

  1. It's output is:
    chosen_reward: -626.8788452148438 | rejected_reward: -405.09423828125 | diff: -221.78460693359375

  2. The chosen_reward is smaller than that of rejected_reward. However, it shows that in (https://huggingface.co/openbmb/Eurus-RM-7b), the Output: 47.4404296875

  3. Can you give me some suggestions?

@liuqi8827 liuqi8827 changed the title The Eurus-RM-7b cannot predict the score correctly (1) The Eurus-RM-7b cannot predict the score correctly. (2) Eurus-7b-kto performs poorly in coding. May 29, 2024
@liuqi8827
Copy link
Author

"Eurus-7b-kto" performs poorly in coding.

  1. I run bash run.sh
  2. I got the result in /Eurus/eval/result/mbpp/result.txt. It shows:
    {'accuracy': 0.0, 'exec_error': 0.0, 'format_error': 100.0}
    However, the paper Table 3 reported:
image

Thus, result.txt does not match the paper's performance.

  1. I also got Eurus/eval/result/leetcode/samples.jsonl and /Eurus/eval/result/human_eval/samples.jsonl.
    Can you tell me how to reproduce the performances in the paper Table 3.

@lifan-yuan
Copy link
Collaborator

Hi,

Thanks for your interest and sorry for the trouble.

Re RM: the case in hf is outdated because we previously adopted an incorrect template -- it should be Mistral template ([INST], [/INST]), but we made a typo and used ([INST], [\INST]) in hf page, which leads to incorrect results. I haven't got time to test it again yet but your use case as shown above seems correct.

Re evaluation: The previous version of eval code may be buggy for some reason. We have updated the code and it should be able to reproduce the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants