Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Add BFCL_V2_Live Dataset #580

Merged
merged 31 commits into from
Aug 19, 2024

Conversation

HuanzhiMao
Copy link
Collaborator

@HuanzhiMao HuanzhiMao commented Aug 13, 2024

In this release, we hope to provide insights on whether the model exhibits overfitting with respect to the BFCL public dataset. Introducing the BFCL-Live dataset, which consists of 2.2k real-world function calling scenarios. This dataset is categorized into simple, multiple function, parallel function, parallel multiple function, and relevance detection groups, all evaluated through AST (Abstract Syntax Tree).

By comparing scores across the two BFCL datasets, we aim to identify any signs of data contamination. This will help ensure our model's performance is both robust and reliable across different data environments.

To read more about the composition and construction of this live dataset, please refer to our blog.

Thanks to @yixinhuang48 and @JasonHuang1103 for helping clean the dataset.


Also in this PR:

  1. Update to BFCL Dataset Format:

    • In the V1 version of BFCL, the question field represented the user's query. With the introduction of V2_Live, the format has been updated to accommodate the inclusion of system prompts, user prompts, and assistant response.
    • To ensure consistency, messages from the V1 dataset have been converted to the V2_Live format. For example, a V1 entry like "What is the weather like in Berkeley, CA" is now represented as "[{"role": "user", "content": "What is the weather like in Berkeley, CA"}]".
    • Consequently, all V1 datasets have been renamed to V2 to reflect this change, signaling that they are not backward-compatible.
    • All model handlers and the eval checker has been updated accordingly.
  2. Update to the overall_accuracy calculation formula:

    • For BFCL V2 Leaderboard, the overall accuracy will be the unweighted average of each of the sub-categories.

      • "exec_simple", "exec_parallel", "exec_multiple", "exec_parallel_multiple", "simple", "irrelevance", "parallel", "multiple", "parallel_multiple", "java", "javascript", "rest", "live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"
    • For BFCL V2 Live Leaderboard (this contains only the Live categories), the overall accuracy will be the weighted average of each of the Live sub-categories.

      • "live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"
  3. Simplification of Claude Handlers:

    • Previously, the codebase included two separate handlers: ClaudeFCHandler (for Claude models in FC mode) and ClaudePromptingHandler (for Claude models in prompting mode).
    • This PR merges these into a single ClaudeHandler, streamlining the code without altering functionality.
  4. Improve Error Log Readability

  5. resolve [BFCL] Evaluation with Correct Precision Settings for Locally-Hosted Models #575

  6. resolve [BFCL] Get rid of legacy naming convention for LLM generated files #485


Co-authored-by: Charlie Cheng-Jie Ji charliechengjieji@berkeley.edu
Co-authored-by: Fanjia Yan fanjiayan@berkeley.edu

@HuanzhiMao HuanzhiMao marked this pull request as ready for review August 15, 2024 16:54
@Fanjia-Yan
Copy link
Collaborator

The change looks good to me in general. I will start spot testings to verify the functionalities

Copy link
Collaborator

@Fanjia-Yan Fanjia-Yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Copy link
Collaborator

@CharlieJCJ CharlieJCJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on gpt-4o-2024-08-06-FC, mistral models, yi-large-fc for spot checks. Success runs end-to-end.

Example command used during testing.

❯ python openfunctions_evaluation.py --model gpt-4o-2024-08-06-FC --test-category v2_live num-threads 8  

❯ python eval_runner.py --model gpt-4o-2024-08-06-FC --test-category v2_live                             

@ShishirPatil ShishirPatil merged commit 30124c4 into ShishirPatil:main Aug 19, 2024
ShishirPatil pushed a commit that referenced this pull request Aug 19, 2024
This PR updates the leaderboard with the new BFCL V2 dataset score from
#580.
@HuanzhiMao HuanzhiMao deleted the bfcl_v2_live branch August 19, 2024 17:24
aw632 pushed a commit to vinaybagade/gorilla that referenced this pull request Aug 22, 2024
In this release, we hope to provide insights on whether the model
exhibits overfitting with respect to the BFCL public dataset.
Introducing the BFCL-Live dataset, which consists of 2.2k real-world
function calling scenarios. This dataset is categorized into `simple`,
`multiple function`, `parallel function`, `parallel multiple function`,
and `relevance detection` groups, all evaluated through AST (Abstract
Syntax Tree).

By comparing scores across the two BFCL datasets, we aim to identify any
signs of data contamination. This will help ensure our model's
performance is both robust and reliable across different data
environments.

To read more about the composition and construction of this live
dataset, please refer to our
[blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html).

Thanks to @yixinhuang48 and @JasonHuang1103 for helping clean the
dataset.

---------

**Also in this PR**:

1. Update to BFCL Dataset Format:

- In the V1 version of BFCL, the `question` field represented the user's
query. With the introduction of V2_Live, the format has been updated to
accommodate the inclusion of system prompts, user prompts, and assistant
response.
- To ensure consistency, messages from the V1 dataset have been
converted to the V2_Live format. For example, a V1 entry like `"What is
the weather like in Berkeley, CA"` is now represented as `"[{"role":
"user", "content": "What is the weather like in Berkeley, CA"}]"`.
- Consequently, all V1 datasets have been renamed to V2 to reflect this
change, signaling that they are not backward-compatible.
- All model handlers and the eval checker has been updated accordingly.

2. Update to the overall_accuracy calculation formula:
- For BFCL V2 Leaderboard, the overall accuracy will be the
**unweighted** average of each of the sub-categories`.

- `"exec_simple", "exec_parallel", "exec_multiple",
"exec_parallel_multiple", "simple", "irrelevance", "parallel",
"multiple", "parallel_multiple", "java", "javascript", "rest",
"live_simple", "live_multiple", "live_parallel",
"live_parallel_multiple", "live_irrelevance", "live_relevance"`

- For BFCL V2 Live Leaderboard (this contains only the Live categories),
the overall accuracy will be the **weighted** average of each of the
Live sub-categories.
- `"live_simple", "live_multiple", "live_parallel",
"live_parallel_multiple", "live_irrelevance", "live_relevance"`

3. Simplification of Claude Handlers:

- Previously, the codebase included two separate handlers:
`ClaudeFCHandler` (for Claude models in FC mode) and
`ClaudePromptingHandler` (for Claude models in prompting mode).
- This PR merges these into a single `ClaudeHandler`, streamlining the
code without altering functionality.

4. Improve Error Log Readability

5. resolve ShishirPatil#485 

---------

Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu>
Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>
@HuanzhiMao HuanzhiMao added the BFCL-General General BFCL Issue label Aug 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFCL-General General BFCL Issue
Projects
None yet
4 participants