forked from EleutherAI/lm-evaluation-harness
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zero…
…shot 0% -> 42%) (EleutherAI#1356) * update bbh, gsm8k, mmlu parsing logic and prompts * remove the formatting prompt (bbh) + minor update (mmlu) * update bbh, gsm8k, mmlu zeroshot, revert fewshots * update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot * remove take_last, update to use docs parameters * add newline * ruff formatting * Update pyproject.toml * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
- Loading branch information
Showing
67 changed files
with
1,420 additions
and
53 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,18 @@ | ||
"dataset_name": "boolean_expressions" | ||
"description": "Evaluate the result of a random Boolean expression.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_boolean_expressions" | ||
|
||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: "regex" | ||
group_select: -1 | ||
regex_pattern: "\\b(True|False)\\b" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,18 @@ | ||
"dataset_name": "causal_judgement" | ||
"description": "Answer questions about causal attribution.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_causal_judgement" | ||
|
||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: "regex" | ||
group_select: -1 | ||
regex_pattern: "\\b(Yes|No|yes|no)\\b" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,20 @@ | ||
"dataset_name": "date_understanding" | ||
"description": "Infer the date from context.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_date_understanding" | ||
|
||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: !function utils.MultiChoiceRegexFilter | ||
group_select: -1 | ||
ignore_case: true | ||
ignore_punctuation: true | ||
regex_pattern: "(\\([A-Z]\\))" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,20 @@ | ||
"dataset_name": "disambiguation_qa" | ||
"description": "Clarify the meaning of sentences with ambiguous pronouns.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_disambiguation_qa" | ||
|
||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: !function utils.MultiChoiceRegexFilter | ||
group_select: -1 | ||
ignore_case: true | ||
ignore_punctuation: true | ||
regex_pattern: "(\\([A-Z]\\))" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,17 @@ | ||
"dataset_name": "dyck_languages" | ||
"description": "Correctly close a Dyck-n word.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_dyck_languages" | ||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: "regex" | ||
group_select: -1 | ||
regex_pattern: "(?<= )([\" \\[\\(<{}>\\)\\]]+)|([\" \\[\\(<{}>\\)\\]]+)" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,18 @@ | ||
"dataset_name": "formal_fallacies" | ||
"description": "Distinguish deductively valid arguments from formal fallacies.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_formal_fallacies" | ||
|
||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: "regex" | ||
group_select: -1 | ||
regex_pattern: "\\b(valid|invalid)\\b" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,20 @@ | ||
"dataset_name": "geometric_shapes" | ||
"description": "Name geometric shapes from their SVG paths.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_geometric_shapes" | ||
|
||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: !function utils.MultiChoiceRegexFilter | ||
group_select: -1 | ||
ignore_case: true | ||
ignore_punctuation: true | ||
regex_pattern: "(\\([A-Z]\\))" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,20 @@ | ||
"dataset_name": "hyperbaton" | ||
"description": "Order adjectives correctly in English sentences.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_hyperbaton" | ||
|
||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: !function utils.MultiChoiceRegexFilter | ||
group_select: -1 | ||
ignore_case: true | ||
ignore_punctuation: true | ||
regex_pattern: "(\\([A-Z]\\))" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
16 changes: 15 additions & 1 deletion
16
lm_eval/tasks/bbh/cot_zeroshot/logical_deduction_five_objects.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,19 @@ | ||
"dataset_name": "logical_deduction_five_objects" | ||
"description": "A logical deduction task which requires deducing the order of a sequence of objects.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_logical_deduction_five_objects" | ||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: !function utils.MultiChoiceRegexFilter | ||
group_select: -1 | ||
ignore_case: true | ||
ignore_punctuation: true | ||
regex_pattern: "(\\([A-Z]\\))" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
16 changes: 15 additions & 1 deletion
16
lm_eval/tasks/bbh/cot_zeroshot/logical_deduction_seven_objects.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,19 @@ | ||
"dataset_name": "logical_deduction_seven_objects" | ||
"description": "A logical deduction task which requires deducing the order of a sequence of objects.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_logical_deduction_seven_objects" | ||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: !function utils.MultiChoiceRegexFilter | ||
group_select: -1 | ||
ignore_case: true | ||
ignore_punctuation: true | ||
regex_pattern: "(\\([A-Z]\\))" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
16 changes: 15 additions & 1 deletion
16
lm_eval/tasks/bbh/cot_zeroshot/logical_deduction_three_objects.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,19 @@ | ||
"dataset_name": "logical_deduction_three_objects" | ||
"description": "A logical deduction task which requires deducing the order of a sequence of objects.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_logical_deduction_three_objects" | ||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: !function utils.MultiChoiceRegexFilter | ||
group_select: -1 | ||
ignore_case: true | ||
ignore_punctuation: true | ||
regex_pattern: "(\\([A-Z]\\))" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,19 @@ | ||
"dataset_name": "movie_recommendation" | ||
"description": "Recommend movies similar to the given list of movies.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_movie_recommendation" | ||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: !function utils.MultiChoiceRegexFilter | ||
group_select: -1 | ||
ignore_case: true | ||
ignore_punctuation: true | ||
regex_pattern: "(\\([A-Z]\\))" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
16 changes: 15 additions & 1 deletion
16
lm_eval/tasks/bbh/cot_zeroshot/multistep_arithmetic_two.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,19 @@ | ||
"dataset_name": "multistep_arithmetic_two" | ||
"description": "Solve multi-step arithmetic problems.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_multistep_arithmetic_two" | ||
|
||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: !function utils.NumberParseRegexFilter | ||
group_select: -1 | ||
regex_pattern: "([-0-9]+)" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,17 @@ | ||
"dataset_name": "navigate" | ||
"description": "Given a series of navigation instructions, determine whether one would end up back at the starting point.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_navigate" | ||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: "regex" | ||
group_select: -1 | ||
regex_pattern: "\\b(Yes|No|yes|no)\\b" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,18 @@ | ||
"dataset_name": "object_counting" | ||
"description": "Questions that involve enumerating objects and asking the model to count them.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_object_counting" | ||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: !function utils.NumberParseRegexFilter | ||
group_select: -1 | ||
regex_pattern: "([-0-9]+)" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,19 @@ | ||
"dataset_name": "penguins_in_a_table" | ||
"description": "Answer questions about a table of penguins and their attributes.\n\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n" | ||
"doc_to_text": "Q: {{input}}\nA: Let's think step by step." | ||
"include": "_cot_zeroshot_template_yaml" | ||
"task": "bbh_cot_zeroshot_penguins_in_a_table" | ||
filter_list: | ||
- name: "flexible-extract" | ||
filter: | ||
- function: !function utils.MultiChoiceRegexFilter | ||
group_select: -1 | ||
ignore_case: true | ||
ignore_punctuation: true | ||
regex_pattern: "(\\([A-Z]\\))" | ||
- function: "take_first" | ||
- name: "strict-match" | ||
filter: | ||
- function: "regex" | ||
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))" | ||
- function: "take_first" |
Oops, something went wrong.