You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm also wandering how to run the evaluations on different models like llama-3. Can anyone clarify the process of evaluating models other than GPT & Claude?
I think firstly we need to implement the sampler for ourselves. After that, add the model we want to evaluate with the corresponding sampler to the models dict in simple_evals.py. But I'm not sure if this is the correct way to evaluate local models.
GPT and Claude are scoring models, not evaluation models. The evaluation data and methods (scoring methods) published in this code are as follows. To evaluate LLaMA-3, the following steps are recommended:
Use the SimpleQA dataset to query LLaMA-3 for results.
Score the results using the evaluation method of SimpleQA.
Yes, how do we run this code to evaluate a language model?
The text was updated successfully, but these errors were encountered: