[BFCL] Improve Warning Message when Aggregating Results #517

HuanzhiMao · 2024-07-09T00:02:51Z

As mentioned in #506, this PR make the warning messages more informative for user to know action items when aggregating leaderboard results.

berkeley-function-call-leaderboard/eval_checker/eval_runner_helper.py

ShishirPatil · 2024-07-30T08:25:14Z

@CharlieJCJ can you review this and LGTM if ready?

CharlieJCJ · 2024-07-30T18:18:34Z

@CharlieJCJ can you review this and LGTM if ready?

The current PR doesn't add the "actionable error message" yet, I think in general we should error message + actionable items across all functionalities, not just this place.

I will review this version today for now (which Improve Warning Message when Aggregating Results), and we can add actionable items error message in a new issue / PR after the refactored version and making those UX enhancements moving forward gradually.

…ing' into data-aggregate-warning

CharlieJCJ · 2024-07-31T08:49:45Z

Testing with gorilla-openfunctions-v2, replicatable with other models:

Example behaviors:

Case 1: If everything is presented in result and score

(BFCL): /data/gorilla/berkeley-function-call-leaderboard/eval_checker]$ python eval_runner.py --model gorilla-openfunctions-v2 --test-category simple
🦍 Model: gorilla-openfunctions-v2
🔍 Running test: simple
✅ Test completed: simple. 🎯 Accuracy: 0.9475
📈 Aggregating data to generate leaderboard score table...

============================== Model Category Status ==============================

Model: gorilla-openfunctions-v2
  All categories are present and evaluated.


🎉 All categories are present and evaluated for all models!
📈 Leaderboard statistics generated successfully! See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.
🏁 Evaluation completed. See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.

Case 2: If I deleted /score/gorilla-openfunctions-v2/java_score.json from Case 1

(BFCL):/data/gorilla/berkeley-function-call-leaderboard/eval_checker]$ python eval_runner.py --model gorilla-openfunctions-v2 --test-category simple
🦍 Model: gorilla-openfunctions-v2
🔍 Running test: simple
✅ Test completed: simple. 🎯 Accuracy: 0.9475
📈 Aggregating data to generate leaderboard score table...

============================== Model Category Status ==============================

Model: gorilla-openfunctions-v2
  Unevaluated results for 1 categories:
    - java

======================================== Recommended Actions ========================================

To address these issues, run the following commands from the project root directory:

cd eval_checker && \
python eval_runner.py --model gorilla-openfunctions-v2 --test-category java

====================================================================================================

📈 Leaderboard statistics generated successfully! See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.
🏁 Evaluation completed. See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.

Case 3: If I then deleted /result/gorilla-openfunctions-v2/gorilla_openfunctions_v1_test_rest_result.json from Case 2

(BFCL):/data/gorilla/berkeley-function-call-leaderboard/eval_checker]$ python eval_runner.py --model gorilla-openfunctions-v2 --test-category simple
🦍 Model: gorilla-openfunctions-v2
🔍 Running test: simple
✅ Test completed: simple. 🎯 Accuracy: 0.9475
📈 Aggregating data to generate leaderboard score table...

============================== Model Category Status ==============================

Model: gorilla-openfunctions-v2
  Missing results for 1 categories:
    - rest
  Unevaluated results for 1 categories:
    - java

======================================== Recommended Actions ========================================

To address these issues, run the following commands from the project root directory:

cd .. && \
python openfunctions_evaluation.py --model gorilla-openfunctions-v2 --test-category rest && \
cd eval_checker && \
python eval_runner.py --model gorilla-openfunctions-v2 --test-category rest && \
python eval_runner.py --model gorilla-openfunctions-v2 --test-category java

====================================================================================================

📈 Leaderboard statistics generated successfully! See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.
🏁 Evaluation completed. See /data/gorilla/berkeley-function-call-leaderboard/score/data.csv for evaluation results.

CharlieJCJ

LGTM

…e user asks for, and also give generation and evaluation multiple test categories in one line instead of multi-lines.

…ing' into data-aggregate-warning

… model(s)

HuanzhiMao

Tested. LGTM

…#517) As mentioned in ShishirPatil#506, this PR make the warning messages more informative for user to know action items when aggregating leaderboard results. --------- Co-authored-by: CharlieJCJ <charliechengjieji@berkeley.edu>

HuanzhiMao added 2 commits July 8, 2024 17:00

add print statement

e6a1f7b

update eval_runner_helper with the function check_all_category_present

3651cfe

HuanzhiMao marked this pull request as ready for review July 10, 2024 21:42

CharlieJCJ reviewed Jul 11, 2024

View reviewed changes

berkeley-function-call-leaderboard/eval_checker/eval_runner_helper.py Outdated Show resolved Hide resolved

berkeley-function-call-leaderboard/eval_checker/eval_runner_helper.py Outdated Show resolved Hide resolved

HuanzhiMao added 3 commits July 10, 2024 23:56

fix wording

117207d

Merge branch 'main' into data-aggregate-warning

09c7bdc

Merge branch 'main' into data-aggregate-warning

9bd7760

HuanzhiMao and others added 5 commits July 31, 2024 01:17

Merge branch 'main' into data-aggregate-warning

eff4819

Merge remote-tracking branch 'upstream/main' into data-aggregate-warning

bbb9b63

Merge remote-tracking branch 'refs/remotes/origin/data-aggregate-warn…

30517a0

…ing' into data-aggregate-warning

[add] actionable eval debug messages

a06bc54

[add] in series of commands

a61d392

CharlieJCJ approved these changes Jul 31, 2024

View reviewed changes

Merge branch 'main' into data-aggregate-warning

4f62912

HuanzhiMao added the DO NOT MERGE Not ready to be merged label Aug 5, 2024

CharlieJCJ and others added 5 commits August 6, 2024 22:34

[add] change to give warning message to models and test categories th…

476a595

…e user asks for, and also give generation and evaluation multiple test categories in one line instead of multi-lines.

Merge remote-tracking branch 'refs/remotes/origin/data-aggregate-warn…

324b463

…ing' into data-aggregate-warning

Merge branch 'main' into data-aggregate-warning

7599110

[fix] make all_category a set

83ead57

Merge branch 'main' into data-aggregate-warning

aab79d9

HuanzhiMao removed the DO NOT MERGE Not ready to be merged label Aug 7, 2024

CharlieJCJ and others added 3 commits August 7, 2024 20:29

[fix] address issues when result folder doesn't contain the requested…

16a80d3

… model(s)

disable message when --model is not provided

a8e55b5

Merge branch 'main' into data-aggregate-warning

2605080

HuanzhiMao commented Aug 8, 2024

View reviewed changes

ShishirPatil approved these changes Aug 10, 2024

View reviewed changes

ShishirPatil merged commit 379db26 into ShishirPatil:main Aug 10, 2024

HuanzhiMao deleted the data-aggregate-warning branch August 13, 2024 06:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] Improve Warning Message when Aggregating Results #517

[BFCL] Improve Warning Message when Aggregating Results #517

HuanzhiMao commented Jul 9, 2024

ShishirPatil commented Jul 30, 2024

CharlieJCJ commented Jul 30, 2024 •

edited

Loading

CharlieJCJ commented Jul 31, 2024 •

edited

Loading

CharlieJCJ left a comment

HuanzhiMao left a comment

[BFCL] Improve Warning Message when Aggregating Results #517

[BFCL] Improve Warning Message when Aggregating Results #517

Conversation

HuanzhiMao commented Jul 9, 2024

ShishirPatil commented Jul 30, 2024

CharlieJCJ commented Jul 30, 2024 • edited Loading

CharlieJCJ commented Jul 31, 2024 • edited Loading

CharlieJCJ left a comment

Choose a reason for hiding this comment

HuanzhiMao left a comment

Choose a reason for hiding this comment

CharlieJCJ commented Jul 30, 2024 •

edited

Loading

CharlieJCJ commented Jul 31, 2024 •

edited

Loading