Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some bugs in test case prompts/ground truths #608

Merged
merged 7 commits into from
Aug 29, 2024

Conversation

aw632
Copy link
Contributor

@aw632 aw632 commented Aug 26, 2024

Summary of the changes:

ID Reason for Change
simple_5 Allowing 'all' roots as a valid answer, as it includes real roots
simple_96 Clarification needed in the prompt
simple_238 Expanding acceptable answers to include all years of the Civil War (1861-1865)
simple_122 Mismatch between specification (array of integers) and ground truth (array of array of integers)
simple_309 Adding Tampa Bay Buccaneers to ground truth due to Tom Brady's 2020 NFL season
simple_235 Including Lisbon, Portugal as an acceptable answer
simple_267 Accepting "New York" as per the location parameter in the prompt
simple_308 Addressing grammar issues in the prompt
simple_316 Adding "female" as an acceptable answer for Serena Williams
simple_375 Removing LA from ground truth as it's not mentioned in the prompt
simple_379 Including "Australia/Sydney" as a valid timezone identifier
multiple_88 Accepting PC as a platform for the WOW game
multiple_119 Same issue as simple_96, needs clarification in the prompt
multiple_153 Same issue as simple_235, including Lisbon, Portugal as an acceptable answer

This PR does change the leaderboard values and will be updated in a separate PR.

@aw632 aw632 force-pushed the aw632/fix_test_cases branch from 5ad4947 to ccd28be Compare August 26, 2024 18:46
Copy link
Collaborator

@HuanzhiMao HuanzhiMao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aw632,
Thanks for the PR!

Regarding simple_5: The question did not specify which kind of roots it is looking for, so the default value "real" should be used. Thus, the possible answer should be ["", "real"].
Same thing with multiple_88.

The rest looks good to me.

@aw632
Copy link
Contributor Author

aw632 commented Aug 27, 2024

Hi @aw632, Thanks for the PR!

Regarding simple_5: The question did not specify which kind of roots it is looking for, so the default value "real" should be used. Thus, the possible answer should be ["", "real"]. Same thing with multiple_88.

The rest looks good to me.

Hi, thanks for your comment. W.r.t simple_5, in this case, should we change the prompt so the default is "all" instead? Mathematically, that makes more sense as a default.

@HuanzhiMao HuanzhiMao added the BFCL-Dataset BFCL Dataset-Related Issue label Aug 27, 2024
@HuanzhiMao
Copy link
Collaborator

Hi @aw632, Thanks for the PR!
Regarding simple_5: The question did not specify which kind of roots it is looking for, so the default value "real" should be used. Thus, the possible answer should be ["", "real"]. Same thing with multiple_88.
The rest looks good to me.

Hi, thanks for your comment. W.r.t simple_5, in this case, should we change the prompt so the default is "all" instead? Mathematically, that makes more sense as a default.

Fair point. I will update the prompt to eliminate ambiguity.

Copy link
Collaborator

@HuanzhiMao HuanzhiMao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Total number of entries affected: 21
  • Simple: 16 entry
    • simple_5, simple_13, simple_96, simple_122, simple_156, simple_183, simple_235, simple_238, simple_267, simple_308, simple_309, simple_316, simple_375, simple_379, simple_389, simple_398
  • Multiple: 5 entry
    • multiple_74, multiple_88, multiple_119, multiple_153, multiple_183

Copy link
Owner

@ShishirPatil ShishirPatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @aw632 and welcome!

@ShishirPatil ShishirPatil merged commit eca516a into ShishirPatil:main Aug 29, 2024
@aw632 aw632 deleted the aw632/fix_test_cases branch August 30, 2024 06:54
ShishirPatil pushed a commit that referenced this pull request Sep 15, 2024
…627, #635, and #638. (#639)

This PR updates the leaderboard to reflect the change in score due to
the following PR merge:

1. #608
2. #600
3. #616 
4. #623
5. #626
6. #627
7. #635 
8. #638

---------

Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFCL-Dataset BFCL Dataset-Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants