-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BFCL] Improve Warning Message when Aggregating Results #517
[BFCL] Improve Warning Message when Aggregating Results #517
Conversation
berkeley-function-call-leaderboard/eval_checker/eval_runner_helper.py
Outdated
Show resolved
Hide resolved
berkeley-function-call-leaderboard/eval_checker/eval_runner_helper.py
Outdated
Show resolved
Hide resolved
@CharlieJCJ can you review this and LGTM if ready? |
The current PR doesn't add the "actionable error message" yet, I think in general we should error message + actionable items across all functionalities, not just this place. I will review this version today for now (which |
…ing' into data-aggregate-warning
Testing with Example behaviors: Case 1: If everything is presented in (BFCL): /data/gorilla/berkeley-function-call-leaderboard/eval_checker]$ python eval_runner.py --model gorilla-openfunctions-v2 --test-category simple
🦍 Model: gorilla-openfunctions-v2
🔍 Running test: simple
✅ Test completed: simple. 🎯 Accuracy: 0.9475
📈 Aggregating data to generate leaderboard score table...
============================== Model Category Status ==============================
Model: gorilla-openfunctions-v2
All categories are present and evaluated.
🎉 All categories are present and evaluated for all models!
📈 Leaderboard statistics generated successfully! See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.
🏁 Evaluation completed. See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results. Case 2: If I deleted (BFCL):/data/gorilla/berkeley-function-call-leaderboard/eval_checker]$ python eval_runner.py --model gorilla-openfunctions-v2 --test-category simple
🦍 Model: gorilla-openfunctions-v2
🔍 Running test: simple
✅ Test completed: simple. 🎯 Accuracy: 0.9475
📈 Aggregating data to generate leaderboard score table...
============================== Model Category Status ==============================
Model: gorilla-openfunctions-v2
Unevaluated results for 1 categories:
- java
======================================== Recommended Actions ========================================
To address these issues, run the following commands from the project root directory:
cd eval_checker && \
python eval_runner.py --model gorilla-openfunctions-v2 --test-category java
====================================================================================================
📈 Leaderboard statistics generated successfully! See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.
🏁 Evaluation completed. See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results. Case 3: If I then deleted (BFCL):/data/gorilla/berkeley-function-call-leaderboard/eval_checker]$ python eval_runner.py --model gorilla-openfunctions-v2 --test-category simple
🦍 Model: gorilla-openfunctions-v2
🔍 Running test: simple
✅ Test completed: simple. 🎯 Accuracy: 0.9475
📈 Aggregating data to generate leaderboard score table...
============================== Model Category Status ==============================
Model: gorilla-openfunctions-v2
Missing results for 1 categories:
- rest
Unevaluated results for 1 categories:
- java
======================================== Recommended Actions ========================================
To address these issues, run the following commands from the project root directory:
cd .. && \
python openfunctions_evaluation.py --model gorilla-openfunctions-v2 --test-category rest && \
cd eval_checker && \
python eval_runner.py --model gorilla-openfunctions-v2 --test-category rest && \
python eval_runner.py --model gorilla-openfunctions-v2 --test-category java
====================================================================================================
📈 Leaderboard statistics generated successfully! See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.
🏁 Evaluation completed. See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…e user asks for, and also give generation and evaluation multiple test categories in one line instead of multi-lines.
…ing' into data-aggregate-warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested. LGTM
…#517) As mentioned in ShishirPatil#506, this PR make the warning messages more informative for user to know action items when aggregating leaderboard results. --------- Co-authored-by: CharlieJCJ <charliechengjieji@berkeley.edu>
As mentioned in #506, this PR make the warning messages more informative for user to know action items when aggregating leaderboard results.