-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) #1356
update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) #1356
Conversation
Thanks! llm-evaluation is finicky with different prompt resulting in different results. But the prompts here are intended to match the original work. It might be better to make these prompts as an alternative prompt + metric extraction process by making an |
@haileyschoelkopf happy to hear your thoughts on this |
Thank you for the PR! I think we can likely merge the majority of these. I'll try to go through each one by one, but at a high level:
I hope that this makes sense! Also, any tasks that have scores updated should have their versions bumped. |
It seems we agree that in the case of merging this, some tasks would need to be renamed to indicate it being an alternate version? |
possibly some--want to be very careful about giving people way too many variants exposed and causing confusion. |
Thanks for the consideration! I'll test again without the extra specifications of format. I think performance won't drop much |
Awesome! Any thoughts on say, reporting both the old and new GSM8k score to users? (Disentangling following the formatting from giving the correct numeric answer |
The result is as follows. (I pushed the yml files I used for test)
I totally agree for disentangling them. But I think this PR should not be the final version; This should be improved by contribution from everyone. But then the scores would change every time there is improvement. Any thoughts on this? |
c0e34e1
to
18f12c6
Compare
This (iteratively improving benchmark implementations based on observed edge cases) is something we haven’t yet dealt with in this repo, in part because of the past focus on loglikelihood-based multiple choice. Our design philosophy expressly is against, say, optimizing a prompt for each tested model, but in the case of answer extraction it seems there is definitely a case to be made for trying to separate matching the formatting from providing the correct answer. I think having a “strict/ stable” and “loose” frequently updated score reported for generative tasks (via multiple different filter/postprocessing pipelines on one task) might achieve this? And we could report versioning separately for the two scores. I’m still a bit fuzzy on this though, and feedback from the community would be certainly welcome. |
Here's a rough example of how to do the multiple-filters on GSM8k:
this would report both metrics separately under these respective filter names. To merge this we'd want:
(Later PRs can examine other tasks) Happy to help as desired on this, let me know! |
@haileyschoelkopf Great! But how should I split the generation_kwargs (#1356 (comment)), or regexes_to_ignore? Or.. would making another task group be better? |
I would just update the Would prefer not to make an extra task group, I think, because we don't want to have loads of different tasks to choose from based on what works for different models. |
hi @thnkinbtfly , let me know if you need any help bringing forward this PR! |
@haileyschoelkopf Sorry for being late. Made further updates to parse multichoice, word sorting, english numbers, web of lies, sports understandings. Changed the number format of gsm8k to avoid parsing single symbol. Now the mmlu performance is similar to the Orca2 paper, but bbh, gsm8k needs to be improved a bit..
I also separated the filters as you suggested.
|
Furthermore, while debugging, I found that in multigpu evaluation, the order of resps and docs in the following function seemed not to be preserved. lm-evaluation-harness/lm_eval/api/filter.py Lines 22 to 30 in 756eeb6
That is why I changed filter.py to deal with some inputs. |
@haileyschoelkopf I tried greedy decoding, vllm, and couldn't see much change.
|
Does merging from main resolve this? #1369 should solve this I believe.
Amazing! Thank you for your work on this : ) Two last things before we merge:
|
@haileyschoelkopf Merging main did resolve the issue. I preserved gsm8k_cot_zeroshot, since original gsm8k does not seem to be evaluating it. |
Thank you! Going to investigate the test failures and subsequently merge this. Would also like to use the |
Unit test failures appear unrelated, tested on this branch locally and they pass. Linter issue is a known Thank you very much @thnkinbtfly for bringing this PR to completion! We really appreciate your help, and if you'd like to work on anything more in this repo (on other generative tasks or otherwise) please let us know! |
…shot 0% -> 42%) (EleutherAI#1356) * update bbh, gsm8k, mmlu parsing logic and prompts * remove the formatting prompt (bbh) + minor update (mmlu) * update bbh, gsm8k, mmlu zeroshot, revert fewshots * update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot * remove take_last, update to use docs parameters * add newline * ruff formatting * Update pyproject.toml * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
…shot 0% -> 42%) (EleutherAI#1356) * update bbh, gsm8k, mmlu parsing logic and prompts * remove the formatting prompt (bbh) + minor update (mmlu) * update bbh, gsm8k, mmlu zeroshot, revert fewshots * update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot * remove take_last, update to use docs parameters * add newline * ruff formatting * Update pyproject.toml * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Hi, I find zero-shot performance of generative tasks with given prompts and parsing logic yields poor performance.
For example, Orca2-7B yields 0% on mmlu or bbh_cot_zeroshot (Llama2-7B, Mistral-7B also performed poor).
I tried to inspect the outputs and changed the parsing logic, and here is the updated performance (I added gsm8k_cot_zeroshot.yml):
To evaluate, ran the commands somewhat similar to the following (with bf16)
I agree that the performance does not yet match those from paper, but this PR definitely improves the evaluation. Feel free to comment or change this commit. I believe we need to enhance the parsing logic and the prompts to evaluate zero-shot generative tasks with this repo.