Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set Model Temperature to 0 for Consistent Leaderboard Results #500

Closed
HuanzhiMao opened this issue Jul 5, 2024 · 5 comments · Fixed by #574
Closed

Set Model Temperature to 0 for Consistent Leaderboard Results #500

HuanzhiMao opened this issue Jul 5, 2024 · 5 comments · Fixed by #574

Comments

@HuanzhiMao
Copy link
Collaborator

The current model generation script (model_handlers) uses a default temperature of 0.7 for inference. This introduces some degree of randomness into the model output generation, leading to potential variability in the evaluation scores from run to run.
For benchmarking purposes, we should set it to 0 for consistency and reliability of the evaluation results.

@ShishirPatil
Copy link
Owner

What do folks thinks about this? I'm mostly ok as long as we are consistent across all models.

@alonsosilvaallende
Copy link
Contributor

I fully agree with a much lower temperature. Temperature 0.7 is extremely high. However, if I remember correctly it cannot be exactly zero for some models I have tried but strictly positive. Therefore, I think a positive but much smaller value would be better, for example Temperature=0.01.

@hexists
Copy link

hexists commented Jul 25, 2024

hello.
I think lowering the temperature is a good idea.
I think it would be good to set the temperature to 0 or at least a low value close to 0 for reproducibility.
Thank you for providing the BFCL.

@ShishirPatil
Copy link
Owner

Thank you @aastroza and @hexists for weighing in. Ok @HuanzhiMao let's go with a lower temperature then maybe something like 0.1? But this would change fundamentally all numbers in the leaderboard. So, once we land all the existing PRs we can do this? I'll keep this issue open.

@HuanzhiMao
Copy link
Collaborator Author

Thank you @aastroza and @hexists for weighing in. Ok @HuanzhiMao let's go with a lower temperature then maybe something like 0.1? But this would change fundamentally all numbers in the leaderboard. So, once we land all the existing PRs we can do this? I'll keep this issue open.

Yea agree. Let's wait till all PR are merged and then we update this.

Repository owner locked and limited conversation to collaborators Jul 29, 2024
@ShishirPatil ShishirPatil converted this issue into discussion #562 Jul 29, 2024
ShishirPatil added a commit that referenced this issue Aug 10, 2024
The current model response generation script uses a default temperature
of 0.7 for inference. This introduces some degree of randomness into the
model output generation, leading to potential variability in the
evaluation scores from run to run.
For benchmarking purposes, we set it to 0.001 for consistency and
reliability of the evaluation results.

resolves #500 , resolves #562 

This will affect the leaderboard score. We will update it shortly.

---------

Co-authored-by: Shishir Patil <30296397+ShishirPatil@users.noreply.github.com>
aw632 pushed a commit to vinaybagade/gorilla that referenced this issue Aug 22, 2024
The current model response generation script uses a default temperature
of 0.7 for inference. This introduces some degree of randomness into the
model output generation, leading to potential variability in the
evaluation scores from run to run.
For benchmarking purposes, we set it to 0.001 for consistency and
reliability of the evaluation results.

resolves ShishirPatil#500 , resolves ShishirPatil#562 

This will affect the leaderboard score. We will update it shortly.

---------

Co-authored-by: Shishir Patil <30296397+ShishirPatil@users.noreply.github.com>

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants