How do we run this code? #28

atahanuz · 2024-11-05T12:50:41Z

Yes, how do we run this code to evaluate a language model?

hank0316 · 2024-11-21T03:32:45Z

I'm also wandering how to run the evaluations on different models like llama-3. Can anyone clarify the process of evaluating models other than GPT & Claude?

I think firstly we need to implement the sampler for ourselves. After that, add the model we want to evaluate with the corresponding sampler to the models dict in simple_evals.py. But I'm not sure if this is the correct way to evaluate local models.

zhejunliux · 2024-11-26T08:04:11Z

GPT and Claude are scoring models, not evaluation models. The evaluation data and methods (scoring methods) published in this code are as follows. To evaluate LLaMA-3, the following steps are recommended:

Use the SimpleQA dataset to query LLaMA-3 for results.
Score the results using the evaluation method of SimpleQA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we run this code? #28

How do we run this code? #28

atahanuz commented Nov 5, 2024

hank0316 commented Nov 21, 2024

zhejunliux commented Nov 26, 2024

How do we run this code? #28

How do we run this code? #28

Comments

atahanuz commented Nov 5, 2024

hank0316 commented Nov 21, 2024

zhejunliux commented Nov 26, 2024