BFCL April 19th Release (Dataset & Pipeline) #377

HuanzhiMao · 2024-04-21T00:25:40Z

This PR is for the BFCL April 19th Release. In this release:

Bug fix for the evaluation dataset in the executable test categories. This includes updates to both prompts and function docs.
The evaluation_result field has been removed to accommodate the variability in API execution results across different evaluation runs. Instead, a human-verified ground_truth is now included for the executable test categories. During each evaluation run, evaluation_result is generated anew using the ground_truth, and then compared against the model output.
A stricter metric has been adopted when using the structural_match (aka. type match) evaluation criteria ---- For list results, the lengths are compared; for dict results, the keys are matched. This is to account for the fast-changing nature of some of the real-time API results while ensuring the evaluation remains meaningful.
Added another evaluation criterion real_time_match for the executable category, which is a looser form of exact_match specifically for numerical execution results. The execution result must be within a certain percentage threshold (20%) from the expected result to accommodate the live updates of API responses. Users can change this threshold value in eval_checker_constant.py.
Added support to distinguish Cohere's optimized score vs. original score.
Resolved Leaderboard evaluations issues #363

This PR DOES change the leaderboard score. We will update the leaderboard shortly, in a different PR.
We will also update our HuggingFace dataset accordingly.

Co-authored-by: Charlie Cheng-Jie Ji charliechengjieji@berkeley.edu
Co-authored-by: Fanjia Yan fanjiayan@berkeley.edu

CharlieJCJ · 2024-04-22T18:14:32Z

WIP: Undergoing thorough testing

In this PR: 1. Update the evaluation metric for BFCL, in sync with #377. 2. Change the button layout on the landing page. This PR **does not** change the leaderboard value. --------- Co-authored-by: Charlie Cheng-Jie Ji <CharlieJCJ@users.noreply.github.com>

CharlieJCJ · 2024-04-25T17:48:32Z

Fully tested on model-generated results

API sanity check (Non-REST and REST)
python ./eval_runner.py on all categories (i.e. ast and executable)

Action Taken:

improved Rapid API stability during both sanity check and real-time execution to allow max_retry during function calls.
improve geocode API execution stability by passing in free API key from function_credentials
corrected get_time_zone_by_coord

CharlieJCJ

LGTM

ShishirPatil

LGTM

…se (#387) - As mentioned in #377, this PR updates the leaderboard to reflect the score changes resulting from the updates in the executable test category evaluation pipeline. - As mentioned in #386, this PR also adds five new models to the leaderboard. - It also adds a `last_updated` field to the leaderboard. This PR **DOES** change the leaderboard score. --------- Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu>

This PR is for the BFCL April 19th Release. In this release: - Bug fix for the evaluation dataset in the executable test categories. This includes updates to both prompts and function docs. - The `evaluation_result` field has been removed to accommodate the variability in API execution results across different evaluation runs. Instead, a human-verified `ground_truth` is now included for the executable test categories. During each evaluation run, `evaluation_result` is generated anew using the `ground_truth`, and then compared against the model output. - A stricter metric has been adopted when using the `structural_match` (aka. type match) evaluation criteria ---- For `list` results, the lengths are compared; for `dict` results, the keys are matched. This is to account for the fast-changing nature of some of the real-time API results while ensuring the evaluation remains meaningful. - Added another evaluation criterion `real_time_match` for the executable category, which is a looser form of `exact_match` specifically for numerical execution results. The execution result must be within a certain percentage threshold (20%) from the expected result to accommodate the live updates of API responses. Users can change this threshold value in `eval_checker_constant.py`. - Added support to distinguish Cohere's optimized score vs. original score. - Resolved ShishirPatil#363 This PR **DOES** change the leaderboard score. We will update the leaderboard shortly, in a different PR. We will also update our HuggingFace dataset accordingly. --------- Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu> Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>

HuanzhiMao added 6 commits April 20, 2024 17:25

update change log

6fc7394

update change log url

8deab28

update checker

12bfa5b

update executable function source code

d52bc54

update runner

9ccb85e

update runner helper

89b318b

HuanzhiMao marked this pull request as ready for review April 21, 2024 00:30

HuanzhiMao mentioned this pull request Apr 23, 2024

Gorilla Website Update April 22nd #380

Merged

HuanzhiMao added 7 commits April 22, 2024 17:30

wording change

f889095

patch checker for sanity check

f7643cb

bug fix in executable python function

74a4793

clean up

8704818

Merge branch 'main' into executable-overhaul

2ba20ef

update eval_runner

559c27e

update change log

ae0bb8f

HuanzhiMao mentioned this pull request Apr 25, 2024

Leaderboard Update, in sync with BFCL April 19th and April 25th Release #387

Merged

HuanzhiMao added 12 commits April 25, 2024 02:20

Merge remote-tracking branch 'upstream/main' into executable-overhaul

644e046

fix binary operation handling logic

9538ec7

update api sanity check ground truth

cb0ff0f

update UNDERSCORE_TO_DOT

63600c6

update handler map

ce74d90

update executable python function

9912589

update runner helper

891cfe5

update runner

c9f5327

update checker logic

71b166c

support two versions of cohere

b3f262f

clean up

52532d9

patch for exec python function

bc4b7ef

CharlieJCJ approved these changes Apr 25, 2024

View reviewed changes

ShishirPatil approved these changes Apr 25, 2024

View reviewed changes

ShishirPatil merged commit 28a0f42 into ShishirPatil:main Apr 25, 2024

HuanzhiMao deleted the executable-overhaul branch April 25, 2024 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BFCL April 19th Release (Dataset & Pipeline) #377

BFCL April 19th Release (Dataset & Pipeline) #377

HuanzhiMao commented Apr 21, 2024 •

edited

Loading

CharlieJCJ commented Apr 22, 2024

CharlieJCJ commented Apr 25, 2024 •

edited

Loading

CharlieJCJ left a comment

ShishirPatil left a comment

BFCL April 19th Release (Dataset & Pipeline) #377

BFCL April 19th Release (Dataset & Pipeline) #377

Conversation

HuanzhiMao commented Apr 21, 2024 • edited Loading

CharlieJCJ commented Apr 22, 2024

CharlieJCJ commented Apr 25, 2024 • edited Loading

CharlieJCJ left a comment

Choose a reason for hiding this comment

ShishirPatil left a comment

Choose a reason for hiding this comment

HuanzhiMao commented Apr 21, 2024 •

edited

Loading

CharlieJCJ commented Apr 25, 2024 •

edited

Loading