update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) #1356

thnkinbtfly · 2024-01-26T03:59:31Z

Hi, I find zero-shot performance of generative tasks with given prompts and parsing logic yields poor performance.

For example, Orca2-7B yields 0% on mmlu or bbh_cot_zeroshot (Llama2-7B, Mistral-7B also performed poor).

I tried to inspect the outputs and changed the parsing logic, and here is the updated performance (I added gsm8k_cot_zeroshot.yml):

	bbh_cot_zeroshot	gsm8k_cot_zeroshot	mmlu_flan_cot_zeroshot	bbh_zeroshot	gsm8k	mmlu_flan_n_shot_generative
original	0.000461	0	0	0.086315	0.009856	0
changed	0.416219	0.428355	0.535598	0.376133	0.073541	0.485259

To evaluate, ran the commands somewhat similar to the following (with bf16)

accelerate launch -m lm_eval --model hf --tasks bbh_cot_zeroshot --batch_size 1 --num_fewshot=0 --model_args pretrained=Orca-2-7b,attn_implementation=sdpa,dtype=bfloat16 --gen_kwargs temperature=0.2,do_sample=True,max_gen_toks=1024

I agree that the performance does not yet match those from paper, but this PR definitely improves the evaluation. Feel free to comment or change this commit. I believe we need to enhance the parsing logic and the prompts to evaluate zero-shot generative tasks with this repo.

CLAassistant · 2024-01-26T03:59:36Z

All committers have signed the CLA.

lintangsutawika · 2024-01-26T04:33:15Z

Thanks!

llm-evaluation is finicky with different prompt resulting in different results. But the prompts here are intended to match the original work. It might be better to make these prompts as an alternative prompt + metric extraction process by making an alternate_prompts folder in each task directory.

lintangsutawika · 2024-01-26T04:39:54Z

@haileyschoelkopf happy to hear your thoughts on this

haileyschoelkopf · 2024-01-26T14:39:36Z

Thank you for the PR!

I think we can likely merge the majority of these. I'll try to go through each one by one, but at a high level:

Adding more flexible answer extraction is good and we should merge these changes. I've been considering reporting both a "strict" and "flexible" answer extraction score for generative tasks--e.g. for GSM8k we keep the strict reporting, and report both that old score and also the more flexibly extracted new score. Would you be ok with this change, or do you think it would simply confuse users more to have these several scores to understand?
For big-bench-hard, the whitespace changes and answer parsing are good, but I'm unsure about the other changes such as the extra specifications of format--we matched the big-bench-hard prompts to their original paper's implementation here (https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/bbh) , and so deviations from this need to be justified e.g. with precedent.

I hope that this makes sense!

Also, any tasks that have scores updated should have their versions bumped.

lintangsutawika · 2024-01-26T15:22:00Z

It seems we agree that in the case of merging this, some tasks would need to be renamed to indicate it being an alternate version?

haileyschoelkopf · 2024-01-26T15:28:52Z

possibly some--want to be very careful about giving people way too many variants exposed and causing confusion.

thnkinbtfly · 2024-01-27T12:19:07Z

Thanks for the consideration! I'll test again without the extra specifications of format. I think performance won't drop much

haileyschoelkopf · 2024-01-27T12:54:26Z

Awesome! Any thoughts on say, reporting both the old and new GSM8k score to users? (Disentangling following the formatting from giving the correct numeric answer

thnkinbtfly · 2024-01-27T22:35:27Z

The result is as follows. (I pushed the yml files I used for test)

	bbh_cot_zeroshot	bbh_zeroshot
original	0.0004	0.0863
changed (w/o formatting prompt)	0.3878	0.3767
changed (w/ formatting prompt)	0.4162	0.3761

I totally agree for disentangling them. But I think this PR should not be the final version; This should be improved by contribution from everyone. But then the scores would change every time there is improvement. Any thoughts on this?

haileyschoelkopf · 2024-01-28T02:07:22Z

I totally agree for disentangling them. But I think this PR should not be the final version; This should be improved by contribution from everyone. But then the scores would change every time there is improvement. Any thoughts on this?

This (iteratively improving benchmark implementations based on observed edge cases) is something we haven’t yet dealt with in this repo, in part because of the past focus on loglikelihood-based multiple choice.

Our design philosophy expressly is against, say, optimizing a prompt for each tested model, but in the case of answer extraction it seems there is definitely a case to be made for trying to separate matching the formatting from providing the correct answer.

I think having a “strict/ stable” and “loose” frequently updated score reported for generative tasks (via multiple different filter/postprocessing pipelines on one task) might achieve this? And we could report versioning separately for the two scores.

I’m still a bit fuzzy on this though, and feedback from the community would be certainly welcome.

lm_eval/filters/selection.py

lm_eval/tasks/bbh/zeroshot/_zeroshot_template_yaml

haileyschoelkopf · 2024-01-29T15:58:18Z

Here's a rough example of how to do the multiple-filters on GSM8k:

filter_list:
  - name: "strict-match"
    filter:
      - function: "regex"
        group_select: -1
        regex_pattern: "(-?[$0-9.,]{2,})|(-?[$0-9,]+)"
      - function: "take_last"
  - name: "flexible-extract"
    filter:
      - function: "regex"
        regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)."
      - function: "take_first"

this would report both metrics separately under these respective filter names.

To merge this we'd want:

all the tasks report both this tweaked better extraction, and their previous very-strict extraction
applying these changes also to BBH (CoT/non-CoT) fewshot

(Later PRs can examine other tasks)

Happy to help as desired on this, let me know!

thnkinbtfly · 2024-01-29T16:41:33Z

@haileyschoelkopf Great! But how should I split the generation_kwargs (#1356 (comment)), or regexes_to_ignore?

Or.. would making another task group be better?

haileyschoelkopf · 2024-01-29T16:45:12Z

I would just update the generation_kwargs as you've done in this PR, and re: regexes_to_ignore: we can either use the updated ones, or perhaps bring the extra ones inside a filter?

Would prefer not to make an extra task group, I think, because we don't want to have loads of different tasks to choose from based on what works for different models.

haileyschoelkopf · 2024-02-06T14:45:02Z

hi @thnkinbtfly , let me know if you need any help bringing forward this PR!

thnkinbtfly · 2024-02-07T08:29:45Z

@haileyschoelkopf Sorry for being late.

Made further updates to parse multichoice, word sorting, english numbers, web of lies, sports understandings. Changed the number format of gsm8k to avoid parsing single symbol.

Now the mmlu performance is similar to the Orca2 paper, but bbh, gsm8k needs to be improved a bit..

	bbh_cot_zeroshot	gsm8k_cot_zeroshot	mmlu_flan_cot_zeroshot	bbh_zeroshot	gsm8k	mmlu_flan_n_shot_generative
original	0.000461	0	0	0.086315	0.009856	0
updated(old)	0.3878	0.443518	0.517309	0.3767	0.263078	0.486398
updated	0.418215	0.446550	0.522534	0.406644	0.263836	0.540165
Orca2 (paper)				0.4593	0.4723	0.5370

I also separated the filters as you suggested.

I'll try the greedy decoding as Orca2 paper did (just checked..)

thnkinbtfly · 2024-02-07T08:37:17Z

Furthermore, while debugging, I found that in multigpu evaluation, the order of resps and docs in the following function seemed not to be preserved.

lm-evaluation-harness/lm_eval/api/filter.py

Lines 22 to 30 in 756eeb6

    
               @abstractmethod 
        
               def apply(self, resps: Union[List, Iterable], docs: List[dict]) -> Iterable: 
        
                   """ 
        
                   Defines the operation to perform on a list of the `inst.resps` properties of `Instance` objects. 
        
                   Should return the list of (filtered) response lists *in the same order as they were input*, e.g. 
        
                   if pass in [<inst.resps for instance 0>, <inst.resps for instance 1>] should return 
        
                   [<filtered resps for instance 0>, <filtered resps for instance 1>] 
        
                   """ 
        
                   return resps

That is why I changed filter.py to deal with some inputs.

thnkinbtfly · 2024-02-08T10:17:08Z

@haileyschoelkopf I tried greedy decoding, vllm, and couldn't see much change.
I think to follow the Orca2 paper, I need to tweak the prompt (for example, If I use the prompt template written in the Orca2 paper, I could see up to 0.3995 for gsm8k_zeroshot..) But optimizing prompt is not desired for this repo, thus I think I'll stop here. I think I'm done with improving the answer parsing.

I totally agree for disentangling them. But I think this PR should not be the final version; This should be improved by contribution from everyone. But then the scores would change every time there is improvement. Any thoughts on this?

This (iteratively improving benchmark implementations based on observed edge cases) is something we haven’t yet dealt with in this repo, in part because of the past focus on loglikelihood-based multiple choice.

Our design philosophy expressly is against, say, optimizing a prompt for each tested model, but in the case of answer extraction it seems there is definitely a case to be made for trying to separate matching the formatting from providing the correct answer.

I think having a “strict/ stable” and “loose” frequently updated score reported for generative tasks (via multiple different filter/postprocessing pipelines on one task) might achieve this? And we could report versioning separately for the two scores.

I’m still a bit fuzzy on this though, and feedback from the community would be certainly welcome.

haileyschoelkopf · 2024-02-08T13:01:50Z

Furthermore, while debugging, I found that in multigpu evaluation, the order of resps and docs in the following function seemed not to be preserved.

Does merging from main resolve this? #1369 should solve this I believe.

I tried greedy decoding, vllm, and couldn't see much change.
I think to follow the Orca2 paper, I need to tweak the prompt (for example, If I use the prompt template written in the Orca2 paper, I could see up to 0.3995 for gsm8k_zeroshot..) But optimizing prompt is not desired for this repo, thus I think I'll stop here. I think I'm done with improving the answer parsing.

Amazing! Thank you for your work on this : )

Two last things before we merge:

Could you bump the versions of all edited tasks by +1 ?
Could you bring forward the changes from gsm8k_cot_zeroshot to gsm8k_cot ?

# Conflicts: # lm_eval/api/filter.py

thnkinbtfly · 2024-02-13T21:11:44Z

@haileyschoelkopf Merging main did resolve the issue. I preserved gsm8k_cot_zeroshot, since original gsm8k does not seem to be evaluating it.

haileyschoelkopf · 2024-02-14T14:45:10Z

Thank you! Going to investigate the test failures and subsequently merge this. Would also like to use the re module instead of introducing regex as a dependency if that's alright

haileyschoelkopf · 2024-02-19T18:47:37Z

Unit test failures appear unrelated, tested on this branch locally and they pass. Linter issue is a known isort config problem that we will fix in a later PR.

Thank you very much @thnkinbtfly for bringing this PR to completion! We really appreciate your help, and if you'd like to work on anything more in this repo (on other generative tasks or otherwise) please let us know!

RicardoDominguez · 2024-02-24T12:15:54Z

Hi, did this PR request only fix bbh_cot_fewshot? Evaluating bbh_fewshot instead gives close 0% accuracy for most models that I test. bbh/bbh_cot_fewshot now reports exact_match,get-answer whereas bbh_fewshot reports exact_match,none.

…shot 0% -> 42%) (EleutherAI#1356) * update bbh, gsm8k, mmlu parsing logic and prompts * remove the formatting prompt (bbh) + minor update (mmlu) * update bbh, gsm8k, mmlu zeroshot, revert fewshots * update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot * remove take_last, update to use docs parameters * add newline * ruff formatting * Update pyproject.toml * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

update bbh, gsm8k, mmlu parsing logic and prompts

853d9e2

thnkinbtfly requested review from haileyschoelkopf and lintangsutawika as code owners January 26, 2024 03:59

remove the formatting prompt (bbh) + minor update (mmlu)

18f12c6

thnkinbtfly force-pushed the update_bbh_gsm8k_mmlu branch from c0e34e1 to 18f12c6 Compare January 27, 2024 22:37

lintangsutawika reviewed Jan 29, 2024

View reviewed changes

lm_eval/filters/selection.py Outdated Show resolved Hide resolved

lm_eval/tasks/bbh/zeroshot/_zeroshot_template_yaml Show resolved Hide resolved

update bbh, gsm8k, mmlu zeroshot, revert fewshots

ea9ec69

thnkinbtfly added 5 commits February 14, 2024 05:54

Merge remote-tracking branch 'github/main' into update_bbh_gsm8k_mmlu

c2148b0

# Conflicts: # lm_eval/api/filter.py

update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot

2aaa003

remove take_last, update to use docs parameters

ffbaf30

add newline

e906ac5

ruff formatting

ba19ddf

haileyschoelkopf added 2 commits February 19, 2024 13:06

Update pyproject.toml

00ad229

fix format

77f3c2d

haileyschoelkopf approved these changes Feb 19, 2024

View reviewed changes

haileyschoelkopf merged commit 89deeea into EleutherAI:main Feb 19, 2024
3 of 8 checks passed

This was referenced Feb 19, 2024

PR fixing the issue #1391 (wrong contexts in the mgsm task) #1440

Merged

Running BBH CoT #1134

Closed

haileyschoelkopf mentioned this pull request Feb 26, 2024

Transfer zero-shot BBH parsing improvements to few-shot BBH #1481

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) #1356

update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) #1356

thnkinbtfly commented Jan 26, 2024

CLAassistant commented Jan 26, 2024 •

edited

Loading

lintangsutawika commented Jan 26, 2024 •

edited

Loading

lintangsutawika commented Jan 26, 2024

haileyschoelkopf commented Jan 26, 2024

lintangsutawika commented Jan 26, 2024

haileyschoelkopf commented Jan 26, 2024

thnkinbtfly commented Jan 27, 2024

haileyschoelkopf commented Jan 27, 2024

thnkinbtfly commented Jan 27, 2024

haileyschoelkopf commented Jan 28, 2024

haileyschoelkopf commented Jan 29, 2024 •

edited

Loading

thnkinbtfly commented Jan 29, 2024

haileyschoelkopf commented Jan 29, 2024

haileyschoelkopf commented Feb 6, 2024

thnkinbtfly commented Feb 7, 2024 •

edited

Loading

thnkinbtfly commented Feb 7, 2024

thnkinbtfly commented Feb 8, 2024

haileyschoelkopf commented Feb 8, 2024

thnkinbtfly commented Feb 13, 2024

haileyschoelkopf commented Feb 14, 2024

haileyschoelkopf commented Feb 19, 2024

RicardoDominguez commented Feb 24, 2024

update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) #1356

update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) #1356

Conversation

thnkinbtfly commented Jan 26, 2024

CLAassistant commented Jan 26, 2024 • edited Loading

lintangsutawika commented Jan 26, 2024 • edited Loading

lintangsutawika commented Jan 26, 2024

haileyschoelkopf commented Jan 26, 2024

lintangsutawika commented Jan 26, 2024

haileyschoelkopf commented Jan 26, 2024

thnkinbtfly commented Jan 27, 2024

haileyschoelkopf commented Jan 27, 2024

thnkinbtfly commented Jan 27, 2024

haileyschoelkopf commented Jan 28, 2024

haileyschoelkopf commented Jan 29, 2024 • edited Loading

thnkinbtfly commented Jan 29, 2024

haileyschoelkopf commented Jan 29, 2024

haileyschoelkopf commented Feb 6, 2024

thnkinbtfly commented Feb 7, 2024 • edited Loading

thnkinbtfly commented Feb 7, 2024

thnkinbtfly commented Feb 8, 2024

haileyschoelkopf commented Feb 8, 2024

thnkinbtfly commented Feb 13, 2024

haileyschoelkopf commented Feb 14, 2024

haileyschoelkopf commented Feb 19, 2024

RicardoDominguez commented Feb 24, 2024

CLAassistant commented Jan 26, 2024 •

edited

Loading

lintangsutawika commented Jan 26, 2024 •

edited

Loading

haileyschoelkopf commented Jan 29, 2024 •

edited

Loading

thnkinbtfly commented Feb 7, 2024 •

edited

Loading