Leaderboard Update April 1 #299

HuanzhiMao · 2024-03-31T04:58:53Z

This PR is for the Leaderboard April 1 update.
This update comes with new models (Claude-3-Haiku, Databrick-DBRX-Instruct), more advanced AST evaluation process, and updated evaluation datasets. Cost and latency statistics during evaluation are also measured. We also released the manual that our evaluation is based on.

Does this affect leaderboard score?
Yes! Read updated blog 8 - leaderboard to learn more!

Co-authored-by: Charlie Cheng-Jie Ji charliechengjieji@berkeley.edu
Co-authored-by: Fanjia Yan fanjiayan@berkeley.edu

ShishirPatil

LGTM

This PR is for the leaderboard April 8th release: 1. Fixed an oversight that was introduced in #299. For function-calling (FC) models that cannot take `float` type in input, when the parameter type is a `float`, the evaluation procedure will convert that type to `number` in the model input and mention in the parameter description that `This is a float type value.`. An additional field `format: float` will also be included in the model input to make it clear about the type. 2. Update the model handler for Claude, Mistral, and OSS to better parse the model output. This is to patch the handler we released in #299, as it sometimes fails to parse even though the model output is valid. This affects only the prompting models; the FC models are unaffected. This PR **DOES** change the leaderboard score. We will update the leaderboard website shortly, in a different PR. --------- Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu> Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>

This PR is for the Leaderboard April 1 update. This update comes with new models (`Claude-3-Haiku`, `Databrick-DBRX-Instruct`), more advanced AST evaluation process, and updated evaluation datasets. Cost and latency statistics during evaluation are also measured. We also released the manual that our evaluation is based on. Does this affect leaderboard score? Yes! Read updated blog 8 - leaderboard to learn more! --------- Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu> Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>

This PR is for the leaderboard April 8th release: 1. Fixed an oversight that was introduced in ShishirPatil#299. For function-calling (FC) models that cannot take `float` type in input, when the parameter type is a `float`, the evaluation procedure will convert that type to `number` in the model input and mention in the parameter description that `This is a float type value.`. An additional field `format: float` will also be included in the model input to make it clear about the type. 2. Update the model handler for Claude, Mistral, and OSS to better parse the model output. This is to patch the handler we released in ShishirPatil#299, as it sometimes fails to parse even though the model output is valid. This affects only the prompting models; the FC models are unaffected. This PR **DOES** change the leaderboard score. We will update the leaderboard website shortly, in a different PR. --------- Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu> Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>

This PR is for the Leaderboard April 1 update. This update comes with new models (`Claude-3-Haiku`, `Databrick-DBRX-Instruct`), more advanced AST evaluation process, and updated evaluation datasets. Cost and latency statistics during evaluation are also measured. We also released the manual that our evaluation is based on. Does this affect leaderboard score? Yes! Read updated blog 8 - leaderboard to learn more! --------- Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu> Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>

This PR is for the leaderboard April 8th release: 1. Fixed an oversight that was introduced in ShishirPatil#299. For function-calling (FC) models that cannot take `float` type in input, when the parameter type is a `float`, the evaluation procedure will convert that type to `number` in the model input and mention in the parameter description that `This is a float type value.`. An additional field `format: float` will also be included in the model input to make it clear about the type. 2. Update the model handler for Claude, Mistral, and OSS to better parse the model output. This is to patch the handler we released in ShishirPatil#299, as it sometimes fails to parse even though the model output is valid. This affects only the prompting models; the FC models are unaffected. This PR **DOES** change the leaderboard score. We will update the leaderboard website shortly, in a different PR. --------- Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu> Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>

HuanzhiMao added 10 commits March 30, 2024 21:38

Update possible answer, remove old scripts

297e35f

Update checker and all associated helper files

ded39e8

Add handler

e95b21d

Update README.md

aeacd08

Update README.md

7961489

remove .gitattributes file

f697d10

Update requirements.txt and READMD.md

15ba280

update handler

e40363e

update license information

4f85219

updated README.md and links

e3f132c

HuanzhiMao changed the title ~~Leaderboard V2 release~~ Leaderboard Update April 1 Apr 1, 2024

wording change

1e94b0f

ShishirPatil approved these changes Apr 1, 2024

View reviewed changes

ShishirPatil merged commit 6971033 into ShishirPatil:main Apr 1, 2024

HuanzhiMao mentioned this pull request Apr 9, 2024

BFCL April 8th Release #330

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaderboard Update April 1 #299

Leaderboard Update April 1 #299

HuanzhiMao commented Mar 31, 2024 •

edited by ShishirPatil

Loading

ShishirPatil left a comment

Leaderboard Update April 1 #299

Leaderboard Update April 1 #299

Conversation

HuanzhiMao commented Mar 31, 2024 • edited by ShishirPatil Loading

ShishirPatil left a comment

Choose a reason for hiding this comment

HuanzhiMao commented Mar 31, 2024 •

edited by ShishirPatil

Loading