refactor: limit usage of `scipy` and `skilearn` dependencies #2097

nathan-weinberg · 2024-07-12T17:39:54Z

Impact

This PR does the following:

Removes some unused dependencies
Moves all imports of scipy and skilearn
Moves duplicated function weighted_f1_score into lm_eval.utils and does relative importing into local task utils so as to not break YAML files

This is all meant to resolve #2059

Testing

@haileyschoelkopf I could use some advice on how to test these changes - I attempted to run the unit tests after doing pip install lm_eval[all] per the CONTRIBUTING doc but was unable to do so. I can also remove the second commit to minimize the impact of this PR if we are unable to test/the maintainers prefer it.

(venv) [nathan@nathan-redhat lm-evaluation-harness (dep-update)]$ python -m pytest --ignore=tests/tests_master --ignore=tests/extra
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.11.9, pytest-8.2.2, pluggy-1.5.0
rootdir: /home/nathan/instructlab/repos/lm-evaluation-harness
configfile: pyproject.toml
plugins: xdist-3.6.1, cov-5.0.0
collected 74 items / 1 error / 1 skipped                                                                                                                                   

================================================================================== ERRORS ==================================================================================
______________________________________________________________ ERROR collecting tests/models/test_openvino.py ______________________________________________________________
ImportError while importing test module '/home/nathan/instructlab/repos/lm-evaluation-harness/tests/models/test_openvino.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib64/python3.11/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/models/test_openvino.py:6: in <module>
    from optimum.intel import OVModelForCausalLM
E   ModuleNotFoundError: No module named 'optimum'
========================================================================= short test summary info ==========================================================================
ERROR tests/models/test_openvino.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
======================================================================= 1 skipped, 1 error in 27.48s =======================================================================

CLAassistant · 2024-07-12T17:39:59Z

All committers have signed the CLA.

nathan-weinberg · 2024-07-15T02:22:36Z

I ran the same pytest command as the CI locally and had a successful run

(venv) [nathan@nathan-redhat lm-evaluation-harness (dep-update)]$ python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_neuralmagic.py --ignore=tests/models/test_openvino.py
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.11.9, pytest-8.2.2, pluggy-1.5.0 -- /home/nathan/instructlab/repos/lm-evaluation-harness/venv/bin/python

...

============================================================================= warnings summary =============================================================================
tests/models/test_neuron_optimum.py::test_wrap_constant_batch_size
  /home/nathan/instructlab/repos/lm-evaluation-harness/tests/models/test_neuron_optimum.py:21: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
    torch.testing.assert_allclose(out, tensor)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================================================== 69 passed, 1 skipped, 1 warning in 160.69s (0:02:40) ===========================================================

Will retry after fixing lint errors

tiran · 2024-07-15T07:47:56Z

@nathan-weinberg Your PR has lots of unrelated changes. I recommend to move code reformatting and newline at EOF fixes to a separate PR.

nathan-weinberg · 2024-07-15T13:32:29Z

@nathan-weinberg Your PR has lots of unrelated changes. I recommend to move code reformatting and newline at EOF fixes to a separate PR.

Will wait to hear from the maintainers - seems these issues are present in the current upstream codebase so I commited a fix just so I could continue working on it - can cherry-pick the commit and put it in a new PR for sure, thoughts @haileyschoelkopf @lintangsutawika?

lintangsutawika · 2024-07-15T14:19:45Z

Sorry bout that.
#2104 should fix this and then you can pull the latest main.

haileyschoelkopf

Thanks @nathan-weinberg for the PR, will merge once tests pass! (And thanks for adding the note about which tests to exclude in the contributor guide!)

I moved weighted_f1_score into lm_eval.api.metrics .

nathan-weinberg · 2024-07-16T15:21:58Z

Is any additional testing needed?

nathan-weinberg · 2024-07-19T13:09:56Z

Hi @haileyschoelkopf just checking in to see if this can be merged - if you need anything from my side please let me know!

I think I may have lost your commit when rebasing, feel free to readd it, sorry about that.

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

this allows for shared functions to be defined only once while allowing for the YAML function importing to continue working Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

haileyschoelkopf

Hey! Sorry, no idea how this got lost, swear I merged this at the time--anyway, back from ICML now, and doing so now. Thanks for this fix!

* Fix: support PEFT/LoRA with added tokens (EleutherAI#1828) * resize model embeddings * resize only * tokenizer help * load tokenizer before model * add comment and run precommit lint * Add log message Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fixed incorrect check for task type (replace `~` with `not`) (EleutherAI#1865) * fixed docs typos (EleutherAI#1863) * Update polemo2_out.yaml (EleutherAI#1871) * Unpin vllm in dependencies (EleutherAI#1874) * Fix outdated links to the latest links in `docs` (EleutherAI#1876) * [HFLM]Use Accelerate's API to reduce hard-coded CUDA code (EleutherAI#1880) * Fix `batch_size=auto` for HF Seq2Seq models (EleutherAI#1765) (EleutherAI#1790) * fix auto-batch size bug for seq2seq models * run linter * Fix Brier Score (EleutherAI#1847) `gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes. * Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789) * add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit * add mmlu tasks from pile-t5 (EleutherAI#1710) * add mmlu tasks from pile-t5 * Update _mmlu_flan_cot_fewshot_template_yaml * Update _mmlu_flan_cot_zeroshot_template_yaml * Update _mmlu_flan_generative_template_yaml * Update _mmlu_flan_loglikelihood_template_yaml * Update _default_template_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Bigbench fix (EleutherAI#1686) * edit process multiple-choice * split template yaml * remove * modified multiple_choice tasks * udpate * Update multiple_choice_template_b_yaml * Update multiple_choice_template_a_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename `lm_eval.logging -> lm_eval.loggers` (EleutherAI#1858) * rename lm_eval.logging module * fix evaluation tracker args * Updated vllm imports in vllm_causallms.py (EleutherAI#1890) * Reorder vllm imports in vllm_causallms.py * Update vllm_causallms.py * [HFLM]Add support for Ascend NPU (EleutherAI#1886) * [HFLM]Add support for Ascend NPU Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> * bump accelerate dependency version to 0.26.0 for NPU compat. --------- Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * `higher_is_better` tickers in output table (EleutherAI#1893) * Higher is better tickers in output table * add extra check for `higher_is_better` not being None already * Update lm_eval/evaluator.py * fixup format I messed up * add comment (and retrigger tests) --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add dataset card when pushing to HF hub (EleutherAI#1898) * dataset card initial * few fixes * adds groups for math, mmlu, gpqa * added summary agrs * moved sanitize_list to utils * readme update * recreate metadata moved * multiple model support * results latest split fix * readme update and small refactor * fix grouping * add comments * added pathlib * corrected pathlib approach * check whether to create a metadata card * convert posix paths to str * default hf org from token * hf token value error * Add logs after successful upload * logging updates * dataset card example in the readme --------- Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> * Making hardcoded few shots compatible with the chat template mechanism (EleutherAI#1895) * init test 1 * fix * this format seems to be working - need to update all other tasks with the new format * bbh with few shot format * fix fewshot bbh * add mmlu flan cot * samples of cot * kmmlu * fix gsm8k * update keys for mmlu * minerva math * bbh * fix * fix samples * small fixes to templates * last prompt format change * fixing prompt * fixed minerva math format * rm accidental commited file * added doc for few shot samples * Update lm_eval/loggers/evaluation_tracker.py * Update lm_eval/loggers/evaluation_tracker.py * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added check in sampler per code review * added the system from a function, plus an example in minerva math * style * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix unit tests 1 * forcing use of test split --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Try to make existing tests run little bit faster (EleutherAI#1905) * Fix fewshot seed only set when overriding num_fewshot (EleutherAI#1914) Fix EleutherAI#1906 * Complete task list from pr 1727 (EleutherAI#1901) * added tasks and task family descriptors * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * apply format --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add chat template (EleutherAI#1873) * initial chat template * tokenizer attribute check * variable rename * interface update * system instruction * system inst default update * fewshot as multiturn * typing update * indent update * added comments * Adding a fewshot in a more readable way * linting * Moved apply chat template to LM * multiturn alternation fix * cache key update * apply chat template method fix * add system prompt hash to cache_key * tokenizer name property for cache_key * property name fix * linting backward compatibility fix * docs and errors update * add documentation on adding chat template compatibility to model_guide * fewshot as multiturn check fix * saving system inst and chat template in results * eval tracker update * docs update * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (EleutherAI#1867) * glianorex tasks * Create README.md * Update README.md * Update README.md * fix formatting * fix internal formatting * Modify pre-commit hook to check merge conflicts accidentally committed not at current merge commit (EleutherAI#1927) * [add] fld logical formula task (EleutherAI#1931) * Add new Lambada translations (EleutherAI#1897) * added tasks and task family descriptors * configs for the new lambada translations * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * update `lm_eval/tasks/README.md` with task description --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: anthony <anthonydipofi@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Implement NoticIA (EleutherAI#1912) * Noticia * test * Final testes implementation * Fixes * Fix linters * Add The Arabic version of the PICA benchmark (EleutherAI#1917) * Update siqa.yaml (EleutherAI#1909) * Update basque-glue (EleutherAI#1913) * Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml * Test output table layout consistency (EleutherAI#1916) * sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m` * Update __main__.py (EleutherAI#1939) * Add the Arabic version with refactor to Arabic pica to be in alghafa folder (EleutherAI#1940) * Results filenames handling fix (EleutherAI#1926) * results filenames handling moved to utils * zeno results handling fix * tasks_for_model backward compatibility * results files logic moved to tasks_for_model * moved sanitize_model_name to utils * Remove AMMLU Due to Translation (EleutherAI#1948) * Update README.md * Delete lm_eval/tasks/ammlu directory * add include_defaults kwarg to taskmanager, add tests for include_path (EleutherAI#1856) * add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857) * Update interface.md (EleutherAI#1955) * Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848) Fix bug where `self.max_tokens` was not set * `samples` is newline delimited (EleutherAI#1930) * `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix `--gen_kwargs` and VLLM (`temperature` not respected) (EleutherAI#1800) * Update vllm_causallms.py * adjust --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * make write_out.py explicitly error if no splits match (EleutherAI#1796) Co-authored-by: lintangsutawika <lintang@eleuther.ai> * fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (EleutherAI#1956) * fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * add trust_remote_code for piqa (EleutherAI#1983) Signed-off-by: changwangss <chang1.wang@intel.com> * Fix self assignment in neuron_optimum.py (EleutherAI#1990) * [New Task] Add Paloma benchmark (EleutherAI#1928) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix Paloma Template yaml (EleutherAI#1993) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml * update on names * fix paloma template issue --------- Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Log `fewshot_as_multiturn` in results files (EleutherAI#1995) * log fewshot_as_multiturn in general tracker args * Update evaluator.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Added ArabicMMLU (EleutherAI#1987) * Added ArabicMMLU * Rename `ammlu` to `arabicmmlu` * Fix Datasets `--trust_remote_code` (EleutherAI#1998) * Add BertaQA dataset tasks (EleutherAI#1964) * add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * add tokenizer logs info (EleutherAI#1731) * add tokenizer logs info * add no tokenizer case * Update lm_eval/logging_utils.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/logging_utils.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * add updates * fix conflict --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Hotfix breaking import (EleutherAI#2015) * add arc_challenge_mt (EleutherAI#1900) * add arc_challenge_mt * add README * add icelandic * Remove `LM` dependency from `build_all_requests` (EleutherAI#2011) * refactored `lm.apply_chat_template` * nit * fix weird type error * fixed! * skip failing test * pre-commit run all * add type hints * nit * nit * fixup * Added CommonsenseQA task (EleutherAI#1721) * Initial configuration * Using the validation set for the test set, because the test set on HF doesn't have labels * Probably just makes more sense to have validation be validation * fix format ; add docs to tasks/README.md * fix format --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Factor out LM-specific tests (EleutherAI#1859) * separate out optimum/neuralmagic tests to separate job * fix vllm tests * fix bug in --trust_remote_code * use datasets.config instead intentionally * fix remote code issue? * Update interface.md (EleutherAI#1982) * Update interface.md update interface to remove link to really outdated commit of evaluator.py * switch to relative referencing? * Update interface.md --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Fix `trust_remote_code`-related test failures (EleutherAI#2024) * make MMLU trust remote code to fix tests * remove trust remote code * Fixes scrolls task bug with few_shot examples (EleutherAI#2003) Bug: ``` python -m scripts.write_out --task scrolls_quality --output_base_path ~/workspace/ Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/lm-evaluation-harness/scripts/write_out.py", line 92, in <module> main() File "/lm-evaluation-harness/scripts/write_out.py", line 51, in main task_dict = tasks.get_task_dict(task_names, task_manager) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 423, in get_task_dict task_name_from_string_dict = task_manager.load_task_or_group( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 271, in load_task_or_group collections.ChainMap(*map(self._load_individual_task_or_group, task_list)) File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 162, in _load_individual_task_or_group return load_task(task_config, task=name_or_config, group=parent_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 148, in load_task task_object = config["class"]() ^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/scrolls/task.py", line 120, in __init__ super().__init__() File "/lm-evaluation-harness/lm_eval/api/task.py", line 703, in __init__ self._config = TaskConfig(**config) ^^^^^^^^^^^^^^^^^^^^ TypeError: lm_eval.api.task.TaskConfig() argument after ** must be a mapping, not NoneType ``` * fix cache (EleutherAI#2037) * Add chat template to `vllm` (EleutherAI#2034) * add chat template * refactor token padding * nit * nit * check on failing test * check transformers version * remove transformers pin * add ids to test * nit * fixup * fix bos bug * nit * fixup! fix bos bug * increase tolerance for table test * don't detokenize vllm logprobs * Update lm_eval/models/utils.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * pre-commit run --all-files --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fail gracefully upon tokenizer logging failure (EleutherAI#2038) * ship with exact_match function already used ; don't call evaluate.load() on import (EleutherAI#2045) * update to v0.4.3 (EleutherAI#2046) * fix wandb logger module import in example (EleutherAI#2041) * Fix strip whitespace filter (EleutherAI#2048) * batch commit * :Revert "batch commit" This reverts commit d859d1c. * batch commit * checkout from main * checkout from main * checkout from main * checkout from main * checkout from main * cleanup * cleanup * cleanup * cleanup * cleanup * cleanup * update gemma-2 default BOS behavior (EleutherAI#2049) * Update hellaswag.yaml (EleutherAI#2029) * Adds Open LLM Leaderboard Taks (EleutherAI#2047) * adds leaderboard tasks * Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml * add readme * Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml * modify readme * fix bbh task * fix bbh salient task * modify the readme * Delete lm_eval/tasks/leaderboard/ifeval/README.md * Delete lm_eval/tasks/leaderboard/math/README.md * add leaderboard to the tasks repertory * add anouncment about new leaderbaord tasks * linting * Update README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * installs ifeval dependency in new_task github workflow --------- Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * EleutherAI#1442 inverse scaling tasks implementation (EleutherAI#1589) * initial_implementation (test has to be proceeded) * minor fix * revised task name and implemented new task * minor fixes * new tasks implement * minor fix * added 'prompt injection' task * delete prompt injection task (will be implemented at next PR) * trust remote code * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added readme * Update lm_eval/tasks/README.md * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update README.md * precommit? * run precommit on readme --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix TypeError in samplers.py by converting int to str (EleutherAI#2074) Co-authored-by: yhjo <yhjo@suresofttech.com> * Group agg rework (EleutherAI#1741) * add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * we run with bootstrap_iters=0 for printing tests (EleutherAI#2080) * Easier unitxt tasks loading and removal of unitxt library dependancy (EleutherAI#1933) * Updated unitxt loading Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * Revert change to general Readme Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * Adjust fda,squadv2,squad_completion and swde to work accept config in the constructor Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * Fix scrolls Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update documentation Signed-off-by: elronbandel <elron.bandel@ibm.com> * Enforce backward compatability Signed-off-by: elronbandel <elron.bandel@ibm.com> * Format unitxt class Signed-off-by: elronbandel <elron.bandel@ibm.com> --------- Signed-off-by: Elron Bandel <elron.bandel@ibm.com> Signed-off-by: elronbandel <elron.bandel@ibm.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Allow gating EvaluationTracker HF Hub results; customizability (EleutherAI#2051) * batch commit * :Revert "batch commit" This reverts commit d859d1c. * batch commit * checkout from main * checkout from main * checkout from main * checkout from main * checkout from main * cleanup * cleanup * cleanup * cleanup * cleanup * cleanup eval results * cleanup * add check for gated repo * fix jsonline issue * fix * add try catch when gating the details repo * add doc * adds back hub_repo_name * readds hub repo name * Minor doc fix: leaderboard README.md missing mmlu-pro group and task (EleutherAI#2075) leaderboard README.md missing mmlu-pro group and task * fix: utf-8 encoding for logged sample files was missing (EleutherAI#2082) * Update utils.py (EleutherAI#2085) Group Configs with no aggregation will print a empty space as the score for result table. Example ``` | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------------|-------|------|-----:|--------|---|-----:|---|-----:| |group | N/A| | | | | | | | | - task 0 |Yaml |none | 0|acc |↑ |0.4000|± |0.0910| | - task 1 |Yaml |none | 0|acc |↑ |0.3333|± |0.0875| | - task 2 |Yaml |none | 0|acc |↑ |0.2667|± |0.0821| | - task 3 |Yaml |none | 0|acc |↑ |0.3333|± |0.0875| ``` So the `v` variable in the `make_table` needs to check if the value is a float or a string. * batch_size may be str if 'auto' is specified (EleutherAI#2084) * Prettify lm_eval --tasks list (EleutherAI#1929) * add and ; move task list newline logic to new TaskManager.list_all_tasks() method * format table list into markdown table; add config location column * add Output Type column * add logic for printing table of tags separately * merge with main and fix conflicts ; update docstrings --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * make RougeScorer only initialized once (EleutherAI#2090) * Update default.yaml (EleutherAI#2092) * Add new dataset MMLU-SR tasks (EleutherAI#2032) * add mmlusr tasks * renamed all tasks names in mmlusr * edit format and readme * added mmlu_sr * mmlu_sr -> mmlusr * update --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * Irokobench: Benchmark Dataset for African languages (EleutherAI#2042) * add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> * docs: remove trailing sentence from contribution doc (EleutherAI#2098) Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * Added MedConceptsQA Benchmark (EleutherAI#2010) * Added MedConceptsQA Benchmark * pre-commit factor * update group name * update in naming * changed name * Changed mcqa to med_concepts_qa prefix * Added med_concepts_qa to README.md * Changed config files according the new format * Updated README --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * make recurrent_gemma model types included in the force-BOS case (EleutherAI#2105) * formatting (EleutherAI#2104) * docs: align local test command to match CI (EleutherAI#2100) Also add 'test_logs/' to .gitignore Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * Fixed colon in Belebele _default_template_yaml (EleutherAI#2111) * [python] fix haerae tasks (EleutherAI#2112) * fix: broken discord link in CONTRIBUTING.md (EleutherAI#2114) Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * docs: update truthfulqa tasks (EleutherAI#2119) * fix caching module (hotfix for now) (EleutherAI#2124) * Refactor API models (EleutherAI#2008) * refactor pad_token handling to fn * fix docs * add pad_token_handling to vllm * start on API superclass * don't detokenize the returned logits * streamline vllm tokenizer * add type hint * pre-commit * seems to be in working order * add model to init * refactor api models * nit * cleanup * add pbar * fix type hints * change optional dependencies * json encode chat template * add type hints * deal with different prompt input requiremnts * nits * fix * cache inside async * fix * fix * nits * nits * nits * nit * fixup * fixup * nit * add dummy retry * add dummy retry * handle imports; skip failing test * add type hint * add tests * add dependency to tests * add package names to exception * nit * docs; type hints * handle api key * nit * tokenizer bug * fix tokenizer * nit * nit * add better error messages * nit * remove decorator * CI: install api dep * revert evaluator.py * consolidate * consolidate * nits * nit * fix typealias * nit * nit * nit * Update lm_eval/models/api_models.py typo Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/openai_completions.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/anthropic_llms.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/api_models.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix typo * add news section * add info for API * pre-commit * typo * fix bug: unpack logliklehood requests * fix bug: shared gen_kwargs mutated * nit: handle copy properly * Update README.md * Update README.md * Update README.md * Update api_models.py * Update README.md --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * bugfix and docs for API (EleutherAI#2139) * encoding bugfix * encoding bugfix * overload logliklehood rather than loglikehood_tokens * add custom tokenizer * add docs * Update API_guide.md fix link; add note * Update API_guide.md typo * pre-commit * add link in readme * nit * nit * nit * Update API_guide.md nits * Update API_guide.md * Update API_guide.md * Update API_guide.md * Update API_guide.md * Update README.md * Update docs/API_guide.md * Update docs/API_guide.md * Update API_guide.md --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * [Bugfix] add temperature=0 to logprobs and seed args to API models (EleutherAI#2149) * add temperature for log probs * add seed * nit * add new args to test * added warning for api chat models * refactor: limit usage of `scipy` and `skilearn` dependencies (EleutherAI#2097) * refactor: move scipy and sklearn module imports to func imports Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * refactor: consolidate weighted_f1_score func into lm_eval utils Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * lint: allow for utils file to have unused imports this allows for shared functions to be defined only once while allowing for the YAML function importing to continue working Signed-off-by: Nathan Weinberg <nweinber@redhat.com> --------- Signed-off-by: Nathan Weinberg <nweinber@redhat.com> --------- Signed-off-by: changwangss <chang1.wang@intel.com> Signed-off-by: Elron Bandel <elron.bandel@ibm.com> Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Nathan Weinberg <nweinber@redhat.com> Co-authored-by: Nick Doiron <ndoiron@mapmeld.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com> Co-authored-by: Edward Gan <efuzzy@gmail.com> Co-authored-by: DongGeon Lee <dg.lee@postech.ac.kr> Co-authored-by: Huazhong Ji <hzji210@gmail.com> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: KonradSzafer <61851539+KonradSzafer@users.noreply.github.com> Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: LSinev <LSinev@users.noreply.github.com> Co-authored-by: anthony-dipofi <anthonydipofi@gmail.com> Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com> Co-authored-by: MorishT <106973776+MorishT@users.noreply.github.com> Co-authored-by: Iker García-Ferrero <i.garciaferrerosanpelayo@gmail.com> Co-authored-by: khalil <90086758+khalil-Hennara@users.noreply.github.com> Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com> Co-authored-by: Nikita Lozhnikov <nikitml@gmail.com> Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by: johnwee1 <91670254+johnwee1@users.noreply.github.com> Co-authored-by: Wang, Chang <491521017@qq.com> Co-authored-by: Yazeed Alnumay <61038456+Yazeed7@users.noreply.github.com> Co-authored-by: Julen Etxaniz <juletxara@gmail.com> Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> Co-authored-by: Stella Biderman <stellabiderman@gmail.com> Co-authored-by: jonabur <135807120+jonabur@users.noreply.github.com> Co-authored-by: Brendan Murphy <bmurphy592@gmail.com> Co-authored-by: Steven Basart <xksteven@users.noreply.github.com> Co-authored-by: Ogundepo Odunayo <ogundepoodunayo@gmail.com> Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Hanwool Albert Lee <88315152+h-albert-lee@users.noreply.github.com> Co-authored-by: Choyunhui <a01022371341@gmail.com> Co-authored-by: yhjo <yhjo@suresofttech.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Co-authored-by: Pankaj Mathur <pankymathur@gmail.com> Co-authored-by: meg <90473723+meg-huggingface@users.noreply.github.com> Co-authored-by: Wonung Kim <waneon.kim@gmail.com> Co-authored-by: SuperCat <37853425+SkySuperCat@users.noreply.github.com> Co-authored-by: Jess <jessicaojo19@gmail.com> Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> Co-authored-by: Nathan Weinberg <31703736+nathan-weinberg@users.noreply.github.com> Co-authored-by: Ben Shoham Ofir <33639234+Ofir408@users.noreply.github.com> Co-authored-by: jab13x <117719136+jab13x@users.noreply.github.com> Co-authored-by: Jungwhan Kim <53588015+jungwhank@users.noreply.github.com> Co-authored-by: Jennifer Cwagenberg <candiedcode@gmail.com>

…rAI#2097) * refactor: move scipy and sklearn module imports to func imports Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * refactor: consolidate weighted_f1_score func into lm_eval utils Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * lint: allow for utils file to have unused imports this allows for shared functions to be defined only once while allowing for the YAML function importing to continue working Signed-off-by: Nathan Weinberg <nweinber@redhat.com> --------- Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

nathan-weinberg requested review from haileyschoelkopf and lintangsutawika as code owners July 12, 2024 17:39

nathan-weinberg force-pushed the dep-update branch from a8953f7 to 4fd9e3e Compare July 15, 2024 02:34

nathan-weinberg force-pushed the dep-update branch 2 times, most recently from fedc1bd to 9080caa Compare July 15, 2024 15:59

haileyschoelkopf approved these changes Jul 15, 2024

View reviewed changes

nathan-weinberg force-pushed the dep-update branch 2 times, most recently from 6aa9e2a to 9f4bc7a Compare July 18, 2024 22:23

nathan-weinberg added 3 commits July 23, 2024 14:18

refactor: move scipy and sklearn module imports to func imports

d6f0d41

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

refactor: consolidate weighted_f1_score func into lm_eval utils

47406d0

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

lint: allow for utils file to have unused imports

e7063d2

this allows for shared functions to be defined only once while allowing for the YAML function importing to continue working Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

nathan-weinberg force-pushed the dep-update branch from 9f4bc7a to e7063d2 Compare July 23, 2024 18:19

haileyschoelkopf approved these changes Aug 1, 2024

View reviewed changes

haileyschoelkopf merged commit 7f15cce into EleutherAI:main Aug 1, 2024
9 checks passed

nathan-weinberg deleted the dep-update branch August 2, 2024 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: limit usage of `scipy` and `skilearn` dependencies #2097

refactor: limit usage of `scipy` and `skilearn` dependencies #2097

nathan-weinberg commented Jul 12, 2024 •

edited

Loading

CLAassistant commented Jul 12, 2024 •

edited

Loading

nathan-weinberg commented Jul 15, 2024 •

edited

Loading

tiran commented Jul 15, 2024

nathan-weinberg commented Jul 15, 2024

lintangsutawika commented Jul 15, 2024

haileyschoelkopf left a comment

nathan-weinberg commented Jul 16, 2024

nathan-weinberg commented Jul 19, 2024 •

edited

Loading

haileyschoelkopf left a comment

refactor: limit usage of scipy and skilearn dependencies #2097

refactor: limit usage of scipy and skilearn dependencies #2097

Conversation

nathan-weinberg commented Jul 12, 2024 • edited Loading

Impact

Testing

CLAassistant commented Jul 12, 2024 • edited Loading

nathan-weinberg commented Jul 15, 2024 • edited Loading

tiran commented Jul 15, 2024

nathan-weinberg commented Jul 15, 2024

lintangsutawika commented Jul 15, 2024

haileyschoelkopf left a comment

Choose a reason for hiding this comment

nathan-weinberg commented Jul 16, 2024

nathan-weinberg commented Jul 19, 2024 • edited Loading

haileyschoelkopf left a comment

Choose a reason for hiding this comment

refactor: limit usage of `scipy` and `skilearn` dependencies #2097

refactor: limit usage of `scipy` and `skilearn` dependencies #2097

nathan-weinberg commented Jul 12, 2024 •

edited

Loading

CLAassistant commented Jul 12, 2024 •

edited

Loading

nathan-weinberg commented Jul 15, 2024 •

edited

Loading

nathan-weinberg commented Jul 19, 2024 •

edited

Loading