Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFCL April 19th Release (Dataset & Pipeline) #377

Merged
merged 25 commits into from
Apr 25, 2024

Conversation

HuanzhiMao
Copy link
Collaborator

@HuanzhiMao HuanzhiMao commented Apr 21, 2024

This PR is for the BFCL April 19th Release. In this release:

  • Bug fix for the evaluation dataset in the executable test categories. This includes updates to both prompts and function docs.
  • The evaluation_result field has been removed to accommodate the variability in API execution results across different evaluation runs. Instead, a human-verified ground_truth is now included for the executable test categories. During each evaluation run, evaluation_result is generated anew using the ground_truth, and then compared against the model output.
  • A stricter metric has been adopted when using the structural_match (aka. type match) evaluation criteria ---- For list results, the lengths are compared; for dict results, the keys are matched. This is to account for the fast-changing nature of some of the real-time API results while ensuring the evaluation remains meaningful.
  • Added another evaluation criterion real_time_match for the executable category, which is a looser form of exact_match specifically for numerical execution results. The execution result must be within a certain percentage threshold (20%) from the expected result to accommodate the live updates of API responses. Users can change this threshold value in eval_checker_constant.py.
  • Added support to distinguish Cohere's optimized score vs. original score.
  • Resolved Leaderboard evaluations issues #363

This PR DOES change the leaderboard score. We will update the leaderboard shortly, in a different PR.
We will also update our HuggingFace dataset accordingly.


Co-authored-by: Charlie Cheng-Jie Ji charliechengjieji@berkeley.edu
Co-authored-by: Fanjia Yan fanjiayan@berkeley.edu

@HuanzhiMao HuanzhiMao marked this pull request as ready for review April 21, 2024 00:30
@CharlieJCJ
Copy link
Collaborator

WIP: Undergoing thorough testing

ShishirPatil pushed a commit that referenced this pull request Apr 25, 2024
In this PR: 
1. Update the evaluation metric for BFCL, in sync with #377.
2. Change the button layout on the landing page. 

This PR **does not** change the leaderboard value.

---------

Co-authored-by: Charlie Cheng-Jie Ji <CharlieJCJ@users.noreply.github.com>
@CharlieJCJ
Copy link
Collaborator

CharlieJCJ commented Apr 25, 2024

Fully tested on model-generated results

  • API sanity check (Non-REST and REST)
  • python ./eval_runner.py on all categories (i.e. ast and executable)

Action Taken:

  • improved Rapid API stability during both sanity check and real-time execution to allow max_retry during function calls.
  • improve geocode API execution stability by passing in free API key from function_credentials
  • corrected get_time_zone_by_coord

Copy link
Collaborator

@CharlieJCJ CharlieJCJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Owner

@ShishirPatil ShishirPatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShishirPatil ShishirPatil merged commit 28a0f42 into ShishirPatil:main Apr 25, 2024
@HuanzhiMao HuanzhiMao deleted the executable-overhaul branch April 25, 2024 18:13
ShishirPatil pushed a commit that referenced this pull request Apr 26, 2024
…se (#387)

- As mentioned in #377, this PR updates the leaderboard to reflect the
score changes resulting from the updates in the executable test category
evaluation pipeline.
- As mentioned in #386, this PR also adds five new models to the
leaderboard.
- It also adds a `last_updated` field to the leaderboard. 

This PR **DOES** change the leaderboard score.

---------

Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu>
devanshamin pushed a commit to devanshamin/gorilla that referenced this pull request Jul 9, 2024
This PR is for the BFCL April 19th Release. In this release:


- Bug fix for the evaluation dataset in the executable test categories.
This includes updates to both prompts and function docs.
- The `evaluation_result` field has been removed to accommodate the
variability in API execution results across different evaluation runs.
Instead, a human-verified `ground_truth` is now included for the
executable test categories. During each evaluation run,
`evaluation_result` is generated anew using the `ground_truth`, and then
compared against the model output.
- A stricter metric has been adopted when using the `structural_match`
(aka. type match) evaluation criteria ---- For `list` results, the
lengths are compared; for `dict` results, the keys are matched. This is
to account for the fast-changing nature of some of the real-time API
results while ensuring the evaluation remains meaningful.
- Added another evaluation criterion `real_time_match` for the
executable category, which is a looser form of `exact_match`
specifically for numerical execution results. The execution result must
be within a certain percentage threshold (20%) from the expected result
to accommodate the live updates of API responses. Users can change this
threshold value in `eval_checker_constant.py`.
- Added support to distinguish Cohere's optimized score vs. original
score.
- Resolved ShishirPatil#363 

This PR **DOES** change the leaderboard score. We will update the
leaderboard shortly, in a different PR.
We will also update our HuggingFace dataset accordingly.

---------

Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu>
Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>
aw632 pushed a commit to vinaybagade/gorilla that referenced this pull request Aug 22, 2024
This PR is for the BFCL April 19th Release. In this release:


- Bug fix for the evaluation dataset in the executable test categories.
This includes updates to both prompts and function docs.
- The `evaluation_result` field has been removed to accommodate the
variability in API execution results across different evaluation runs.
Instead, a human-verified `ground_truth` is now included for the
executable test categories. During each evaluation run,
`evaluation_result` is generated anew using the `ground_truth`, and then
compared against the model output.
- A stricter metric has been adopted when using the `structural_match`
(aka. type match) evaluation criteria ---- For `list` results, the
lengths are compared; for `dict` results, the keys are matched. This is
to account for the fast-changing nature of some of the real-time API
results while ensuring the evaluation remains meaningful.
- Added another evaluation criterion `real_time_match` for the
executable category, which is a looser form of `exact_match`
specifically for numerical execution results. The execution result must
be within a certain percentage threshold (20%) from the expected result
to accommodate the live updates of API responses. Users can change this
threshold value in `eval_checker_constant.py`.
- Added support to distinguish Cohere's optimized score vs. original
score.
- Resolved ShishirPatil#363 

This PR **DOES** change the leaderboard score. We will update the
leaderboard shortly, in a different PR.
We will also update our HuggingFace dataset accordingly.

---------

Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu>
Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leaderboard evaluations issues
3 participants