Created new task for testing Llama on Asdiv (EleutherAI#2236)

* Created DUP eval code for gsm8k * asdiv * Fixed fewshot=8 issue * added results to .gitignore * reverted unnecessary changes and moved results + gsm8k_dup out of repo to prepare for pull req * fixed whitespace and unintentional hardcoded version change information * created mbpp task * Reverted changes re. mbpp to save for a future Pull req * reverted metrics.py to previous commit * updated asdiv readme to include informaiton about new asdiv_cot_llama task * Apply suggestions from code review --------- Co-authored-by: Alexander Detkov <alexander.d.detkov@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
artemorloff · Aug 23, 2024 · aab42ba · aab42ba
1 parent 5ad23ec
commit aab42ba
Show file tree

Hide file tree

Showing 2 changed files with 93 additions and 0 deletions.
diff --git a/lm_eval/tasks/asdiv/README.md b/lm_eval/tasks/asdiv/README.md
@@ -41,6 +41,11 @@ Homepage: https://github.com/chaochun/nlu-asdiv-dataset
 #### Tasks
 
 * `asdiv`
+* `asdiv_cot_llama`: ASDIV with prompt formatting modified to conform to the evaluation settings described by Meta here: https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals/viewer/Meta-Llama-3.1-8B-Instruct-evals__gsm8k__details?row=0
+    - Note that the CoT prompt from (https://arxiv.org/pdf/2201.11903) is used exactly as in GSM8k-CoT
+    - This file is setup to run identically to the task `gsm8k_cot_llama` but for asdiv.
+    - Use this task with --fewshot_as_multiturn and --apply_chat_template to run correctly with Llama Instruct models.
+
 
 ### Checklist
 

diff --git a/lm_eval/tasks/asdiv/asdiv-cot-llama.yaml b/lm_eval/tasks/asdiv/asdiv-cot-llama.yaml
@@ -0,0 +1,88 @@
+dataset_path: EleutherAI/asdiv
+doc_to_target: "{{answer.split(' (')[0] if answer is defined else target}}"
+doc_to_text: "Given the following problem, reason and give a final answer to the problem.\nProblem: {{body if body is defined}} {{question}}\nYour response should end with \"The final answer is [answer]\" where [answer] is the response to the problem.\n"
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: There are 15 trees in the grove. Grove workers will plant trees in the
+      grove today. After they are done, there will be 21 trees. How many trees did
+      the grove workers plant today?
+    target: There are 15 trees originally. Then there were 21 trees after some more
+      were planted. So there must have been 21 - 15 = 6. The final answer is 6
+  - question: If there are 3 cars in the parking lot and 2 more cars arrive, how many
+      cars are in the parking lot?
+    target: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The final answer
+      is 5
+  - question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many
+      pieces do they have left in total?
+    target: Originally, Leah had 32 chocolates. Her sister had 42. So in total they
+      had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The final answer is 39
+  - question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12
+      lollipops. How many lollipops did Jason give to Denny?
+    target: Jason started with 20 lollipops. Then he had 12 after giving some to Denny.
+      So he gave Denny 20 - 12 = 8. The final answer is 8
+  - question: Shawn has five toys. For Christmas, he got two toys each from his mom and
+      dad. How many toys does he have now?
+    target: Shawn started with 5 toys. If he got 2 toys each from his mom and dad,
+      then that is 4 more toys. 5 + 4 = 9. The final answer is 9
+  - question: There were nine computers in the server room. Five more computers were
+      installed each day, from monday to thursday. How many computers are now in the
+      server room?
+    target: There were originally 9 computers. For each of 4 days, 5 more computers
+      were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The final answer is
+      29
+  - question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday,
+      he lost 2 more. How many golf balls did he have at the end of wednesday?
+    target: Michael started with 58 golf balls. After losing 23 on tuesday, he had
+      58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The final answer
+      is 33
+  - question: Olivia has $23. She bought five bagels for $3 each. How much money does
+      she have left?
+    target: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15
+      dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The final answer is 8
+filter_list:
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: The final answer is ((-?[$0-9.,]{2,})|(-?[0-9]+))
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - '<|eot_id|>'
+  - '<|start_header_id|>user<|end_header_id|>'
+  - 'Q:'
+  - </s>
+  - <|im_end|>
+tag:
+- chain_of_thought
+metadata:
+  version: 1.0
+metric_list:
+- aggregation: mean
+  higher_is_better: true
+  ignore_case: true
+  ignore_punctuation: false
+  metric: exact_match
+  regexes_to_ignore:
+  - ','
+  - \$
+  - '(?s).*#### '
+  - \.$
+num_fewshot: 8
+output_type: generate_until
+repeats: 1
+task: asdiv_cot_llama
+validation_split: validation
+test_split: validation
+should_decontaminate: true
+doc_to_decontamination_query: "{{body}} {{question}}"
+dataset_kwargs:
+  trust_remote_code: true