Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Improve Warning Message when Aggregating Results #517

Merged
merged 19 commits into from
Aug 10, 2024

Conversation

HuanzhiMao
Copy link
Collaborator

As mentioned in #506, this PR make the warning messages more informative for user to know action items when aggregating leaderboard results.

@HuanzhiMao HuanzhiMao marked this pull request as ready for review July 10, 2024 21:42
@ShishirPatil
Copy link
Owner

@CharlieJCJ can you review this and LGTM if ready?

@CharlieJCJ
Copy link
Collaborator

CharlieJCJ commented Jul 30, 2024

@CharlieJCJ can you review this and LGTM if ready?

The current PR doesn't add the "actionable error message" yet, I think in general we should error message + actionable items across all functionalities, not just this place.

I will review this version today for now (which Improve Warning Message when Aggregating Results), and we can add actionable items error message in a new issue / PR after the refactored version and making those UX enhancements moving forward gradually.

@CharlieJCJ
Copy link
Collaborator

CharlieJCJ commented Jul 31, 2024

Testing with gorilla-openfunctions-v2, replicatable with other models:

Example behaviors:

Case 1: If everything is presented in result and score

(BFCL): /data/gorilla/berkeley-function-call-leaderboard/eval_checker]$ python eval_runner.py --model gorilla-openfunctions-v2 --test-category simple
🦍 Model: gorilla-openfunctions-v2
🔍 Running test: simple
✅ Test completed: simple. 🎯 Accuracy: 0.9475
📈 Aggregating data to generate leaderboard score table...

============================== Model Category Status ==============================

Model: gorilla-openfunctions-v2
  All categories are present and evaluated.


🎉 All categories are present and evaluated for all models!
📈 Leaderboard statistics generated successfully! See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.
🏁 Evaluation completed. See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.

Case 2: If I deleted /score/gorilla-openfunctions-v2/java_score.json from Case 1

(BFCL):/data/gorilla/berkeley-function-call-leaderboard/eval_checker]$ python eval_runner.py --model gorilla-openfunctions-v2 --test-category simple
🦍 Model: gorilla-openfunctions-v2
🔍 Running test: simple
✅ Test completed: simple. 🎯 Accuracy: 0.9475
📈 Aggregating data to generate leaderboard score table...

============================== Model Category Status ==============================

Model: gorilla-openfunctions-v2
  Unevaluated results for 1 categories:
    - java

======================================== Recommended Actions ========================================

To address these issues, run the following commands from the project root directory:

cd eval_checker && \
python eval_runner.py --model gorilla-openfunctions-v2 --test-category java

====================================================================================================

📈 Leaderboard statistics generated successfully! See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.
🏁 Evaluation completed. See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.

Case 3: If I then deleted /result/gorilla-openfunctions-v2/gorilla_openfunctions_v1_test_rest_result.json from Case 2

(BFCL):/data/gorilla/berkeley-function-call-leaderboard/eval_checker]$ python eval_runner.py --model gorilla-openfunctions-v2 --test-category simple
🦍 Model: gorilla-openfunctions-v2
🔍 Running test: simple
✅ Test completed: simple. 🎯 Accuracy: 0.9475
📈 Aggregating data to generate leaderboard score table...

============================== Model Category Status ==============================

Model: gorilla-openfunctions-v2
  Missing results for 1 categories:
    - rest
  Unevaluated results for 1 categories:
    - java

======================================== Recommended Actions ========================================

To address these issues, run the following commands from the project root directory:

cd .. && \
python openfunctions_evaluation.py --model gorilla-openfunctions-v2 --test-category rest && \
cd eval_checker && \
python eval_runner.py --model gorilla-openfunctions-v2 --test-category rest && \
python eval_runner.py --model gorilla-openfunctions-v2 --test-category java

====================================================================================================

📈 Leaderboard statistics generated successfully! See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.
🏁 Evaluation completed. See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.

Copy link
Collaborator

@CharlieJCJ CharlieJCJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HuanzhiMao HuanzhiMao added the DO NOT MERGE Not ready to be merged label Aug 5, 2024
@HuanzhiMao HuanzhiMao removed the DO NOT MERGE Not ready to be merged label Aug 7, 2024
Copy link
Collaborator Author

@HuanzhiMao HuanzhiMao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested. LGTM

@ShishirPatil ShishirPatil merged commit 379db26 into ShishirPatil:main Aug 10, 2024
@HuanzhiMao HuanzhiMao deleted the data-aggregate-warning branch August 13, 2024 06:04
aw632 pushed a commit to vinaybagade/gorilla that referenced this pull request Aug 22, 2024
…#517)

As mentioned in ShishirPatil#506, this PR make the warning messages more informative
for user to know action items when aggregating leaderboard results.

---------

Co-authored-by: CharlieJCJ <charliechengjieji@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants