Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) #1356

Merged
merged 10 commits into from
Feb 19, 2024

Conversation

thnkinbtfly
Copy link
Contributor

Hi, I find zero-shot performance of generative tasks with given prompts and parsing logic yields poor performance.

For example, Orca2-7B yields 0% on mmlu or bbh_cot_zeroshot (Llama2-7B, Mistral-7B also performed poor).

I tried to inspect the outputs and changed the parsing logic, and here is the updated performance (I added gsm8k_cot_zeroshot.yml):

bbh_cot_zeroshot gsm8k_cot_zeroshot mmlu_flan_cot_zeroshot bbh_zeroshot gsm8k mmlu_flan_n_shot_generative
original 0.000461 0 0 0.086315 0.009856 0
changed 0.416219 0.428355 0.535598 0.376133 0.073541 0.485259

To evaluate, ran the commands somewhat similar to the following (with bf16)

accelerate launch -m lm_eval --model hf --tasks bbh_cot_zeroshot --batch_size 1 --num_fewshot=0 --model_args pretrained=Orca-2-7b,attn_implementation=sdpa,dtype=bfloat16 --gen_kwargs temperature=0.2,do_sample=True,max_gen_toks=1024

I agree that the performance does not yet match those from paper, but this PR definitely improves the evaluation. Feel free to comment or change this commit. I believe we need to enhance the parsing logic and the prompts to evaluate zero-shot generative tasks with this repo.

@CLAassistant
Copy link

CLAassistant commented Jan 26, 2024

CLA assistant check
All committers have signed the CLA.

@lintangsutawika
Copy link
Contributor

lintangsutawika commented Jan 26, 2024

Thanks!

llm-evaluation is finicky with different prompt resulting in different results. But the prompts here are intended to match the original work. It might be better to make these prompts as an alternative prompt + metric extraction process by making an alternate_prompts folder in each task directory.

@lintangsutawika
Copy link
Contributor

@haileyschoelkopf happy to hear your thoughts on this

@haileyschoelkopf
Copy link
Collaborator

Thank you for the PR!

I think we can likely merge the majority of these. I'll try to go through each one by one, but at a high level:

  • Adding more flexible answer extraction is good and we should merge these changes. I've been considering reporting both a "strict" and "flexible" answer extraction score for generative tasks--e.g. for GSM8k we keep the strict reporting, and report both that old score and also the more flexibly extracted new score. Would you be ok with this change, or do you think it would simply confuse users more to have these several scores to understand?
  • For big-bench-hard, the whitespace changes and answer parsing are good, but I'm unsure about the other changes such as the extra specifications of format--we matched the big-bench-hard prompts to their original paper's implementation here (https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/bbh) , and so deviations from this need to be justified e.g. with precedent.

I hope that this makes sense!

Also, any tasks that have scores updated should have their versions bumped.

@lintangsutawika
Copy link
Contributor

It seems we agree that in the case of merging this, some tasks would need to be renamed to indicate it being an alternate version?

@haileyschoelkopf
Copy link
Collaborator

possibly some--want to be very careful about giving people way too many variants exposed and causing confusion.

@thnkinbtfly
Copy link
Contributor Author

Thanks for the consideration! I'll test again without the extra specifications of format. I think performance won't drop much

@haileyschoelkopf
Copy link
Collaborator

Awesome! Any thoughts on say, reporting both the old and new GSM8k score to users? (Disentangling following the formatting from giving the correct numeric answer

@thnkinbtfly
Copy link
Contributor Author

The result is as follows. (I pushed the yml files I used for test)

bbh_cot_zeroshot bbh_zeroshot
original 0.0004 0.0863
changed (w/o formatting prompt) 0.3878 0.3767
changed (w/ formatting prompt) 0.4162 0.3761

I totally agree for disentangling them. But I think this PR should not be the final version; This should be improved by contribution from everyone. But then the scores would change every time there is improvement. Any thoughts on this?

@haileyschoelkopf
Copy link
Collaborator

I totally agree for disentangling them. But I think this PR should not be the final version; This should be improved by contribution from everyone. But then the scores would change every time there is improvement. Any thoughts on this?

This (iteratively improving benchmark implementations based on observed edge cases) is something we haven’t yet dealt with in this repo, in part because of the past focus on loglikelihood-based multiple choice.

Our design philosophy expressly is against, say, optimizing a prompt for each tested model, but in the case of answer extraction it seems there is definitely a case to be made for trying to separate matching the formatting from providing the correct answer.

I think having a “strict/ stable” and “loose” frequently updated score reported for generative tasks (via multiple different filter/postprocessing pipelines on one task) might achieve this? And we could report versioning separately for the two scores.

I’m still a bit fuzzy on this though, and feedback from the community would be certainly welcome.

@haileyschoelkopf
Copy link
Collaborator

haileyschoelkopf commented Jan 29, 2024

Here's a rough example of how to do the multiple-filters on GSM8k:

filter_list:
  - name: "strict-match"
    filter:
      - function: "regex"
        group_select: -1
        regex_pattern: "(-?[$0-9.,]{2,})|(-?[$0-9,]+)"
      - function: "take_last"
  - name: "flexible-extract"
    filter:
      - function: "regex"
        regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)."
      - function: "take_first"

this would report both metrics separately under these respective filter names.

To merge this we'd want:

  • all the tasks report both this tweaked better extraction, and their previous very-strict extraction
  • applying these changes also to BBH (CoT/non-CoT) fewshot

(Later PRs can examine other tasks)

Happy to help as desired on this, let me know!

@thnkinbtfly
Copy link
Contributor Author

@haileyschoelkopf Great! But how should I split the generation_kwargs (#1356 (comment)), or regexes_to_ignore?

Or.. would making another task group be better?

@haileyschoelkopf
Copy link
Collaborator

I would just update the generation_kwargs as you've done in this PR, and re: regexes_to_ignore: we can either use the updated ones, or perhaps bring the extra ones inside a filter?

Would prefer not to make an extra task group, I think, because we don't want to have loads of different tasks to choose from based on what works for different models.

@haileyschoelkopf
Copy link
Collaborator

hi @thnkinbtfly , let me know if you need any help bringing forward this PR!

@thnkinbtfly
Copy link
Contributor Author

thnkinbtfly commented Feb 7, 2024

@haileyschoelkopf Sorry for being late.

Made further updates to parse multichoice, word sorting, english numbers, web of lies, sports understandings. Changed the number format of gsm8k to avoid parsing single symbol.

Now the mmlu performance is similar to the Orca2 paper, but bbh, gsm8k needs to be improved a bit..

bbh_cot_zeroshot gsm8k_cot_zeroshot mmlu_flan_cot_zeroshot bbh_zeroshot gsm8k mmlu_flan_n_shot_generative
original 0.000461 0 0 0.086315 0.009856 0
updated(old) 0.3878 0.443518 0.517309 0.3767 0.263078 0.486398
updated 0.418215 0.446550 0.522534 0.406644 0.263836 0.540165
Orca2 (paper) 0.4593 0.4723 0.5370

I also separated the filters as you suggested.

  • I'll try the greedy decoding as Orca2 paper did (just checked..)

@thnkinbtfly
Copy link
Contributor Author

Furthermore, while debugging, I found that in multigpu evaluation, the order of resps and docs in the following function seemed not to be preserved.

@abstractmethod
def apply(self, resps: Union[List, Iterable], docs: List[dict]) -> Iterable:
"""
Defines the operation to perform on a list of the `inst.resps` properties of `Instance` objects.
Should return the list of (filtered) response lists *in the same order as they were input*, e.g.
if pass in [<inst.resps for instance 0>, <inst.resps for instance 1>] should return
[<filtered resps for instance 0>, <filtered resps for instance 1>]
"""
return resps

That is why I changed filter.py to deal with some inputs.

@thnkinbtfly
Copy link
Contributor Author

@haileyschoelkopf I tried greedy decoding, vllm, and couldn't see much change.
I think to follow the Orca2 paper, I need to tweak the prompt (for example, If I use the prompt template written in the Orca2 paper, I could see up to 0.3995 for gsm8k_zeroshot..) But optimizing prompt is not desired for this repo, thus I think I'll stop here. I think I'm done with improving the answer parsing.

I totally agree for disentangling them. But I think this PR should not be the final version; This should be improved by contribution from everyone. But then the scores would change every time there is improvement. Any thoughts on this?

This (iteratively improving benchmark implementations based on observed edge cases) is something we haven’t yet dealt with in this repo, in part because of the past focus on loglikelihood-based multiple choice.

Our design philosophy expressly is against, say, optimizing a prompt for each tested model, but in the case of answer extraction it seems there is definitely a case to be made for trying to separate matching the formatting from providing the correct answer.

I think having a “strict/ stable” and “loose” frequently updated score reported for generative tasks (via multiple different filter/postprocessing pipelines on one task) might achieve this? And we could report versioning separately for the two scores.

I’m still a bit fuzzy on this though, and feedback from the community would be certainly welcome.

@haileyschoelkopf
Copy link
Collaborator

Furthermore, while debugging, I found that in multigpu evaluation, the order of resps and docs in the following function seemed not to be preserved.

Does merging from main resolve this? #1369 should solve this I believe.

I tried greedy decoding, vllm, and couldn't see much change.
I think to follow the Orca2 paper, I need to tweak the prompt (for example, If I use the prompt template written in the Orca2 paper, I could see up to 0.3995 for gsm8k_zeroshot..) But optimizing prompt is not desired for this repo, thus I think I'll stop here. I think I'm done with improving the answer parsing.

Amazing! Thank you for your work on this : )

Two last things before we merge:

  • Could you bump the versions of all edited tasks by +1 ?
  • Could you bring forward the changes from gsm8k_cot_zeroshot to gsm8k_cot ?

@thnkinbtfly
Copy link
Contributor Author

@haileyschoelkopf Merging main did resolve the issue. I preserved gsm8k_cot_zeroshot, since original gsm8k does not seem to be evaluating it.

@haileyschoelkopf
Copy link
Collaborator

Thank you! Going to investigate the test failures and subsequently merge this. Would also like to use the re module instead of introducing regex as a dependency if that's alright

@haileyschoelkopf
Copy link
Collaborator

Unit test failures appear unrelated, tested on this branch locally and they pass. Linter issue is a known isort config problem that we will fix in a later PR.

Thank you very much @thnkinbtfly for bringing this PR to completion! We really appreciate your help, and if you'd like to work on anything more in this repo (on other generative tasks or otherwise) please let us know!

@haileyschoelkopf haileyschoelkopf merged commit 89deeea into EleutherAI:main Feb 19, 2024
3 of 8 checks passed
@RicardoDominguez
Copy link

Hi, did this PR request only fix bbh_cot_fewshot? Evaluating bbh_fewshot instead gives close 0% accuracy for most models that I test. bbh/bbh_cot_fewshot now reports exact_match,get-answer whereas bbh_fewshot reports exact_match,none.

image

wx-zhang pushed a commit to wx-zhang/lm-evaluation-harness that referenced this pull request Mar 13, 2024
…shot 0% -> 42%) (EleutherAI#1356)

* update bbh, gsm8k, mmlu parsing logic and prompts

* remove the formatting prompt (bbh) + minor update (mmlu)

* update bbh, gsm8k, mmlu zeroshot, revert fewshots

* update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot

* remove take_last, update to use docs parameters

* add newline

* ruff formatting

* Update pyproject.toml

* fix format

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
djstrong pushed a commit to speakleash/lm-evaluation-harness that referenced this pull request Aug 2, 2024
…shot 0% -> 42%) (EleutherAI#1356)

* update bbh, gsm8k, mmlu parsing logic and prompts

* remove the formatting prompt (bbh) + minor update (mmlu)

* update bbh, gsm8k, mmlu zeroshot, revert fewshots

* update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot

* remove take_last, update to use docs parameters

* add newline

* ruff formatting

* Update pyproject.toml

* fix format

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants