Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we run this code? #28

Open
atahanuz opened this issue Nov 5, 2024 · 2 comments
Open

How do we run this code? #28

atahanuz opened this issue Nov 5, 2024 · 2 comments

Comments

@atahanuz
Copy link

atahanuz commented Nov 5, 2024

Yes, how do we run this code to evaluate a language model?

@hank0316
Copy link

I'm also wandering how to run the evaluations on different models like llama-3. Can anyone clarify the process of evaluating models other than GPT & Claude?

I think firstly we need to implement the sampler for ourselves. After that, add the model we want to evaluate with the corresponding sampler to the models dict in simple_evals.py. But I'm not sure if this is the correct way to evaluate local models.

@zhejunliux
Copy link

GPT and Claude are scoring models, not evaluation models. The evaluation data and methods (scoring methods) published in this code are as follows. To evaluate LLaMA-3, the following steps are recommended:

  1. Use the SimpleQA dataset to query LLaMA-3 for results.
  2. Score the results using the evaluation method of SimpleQA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants