forked from EleutherAI/lm-evaluation-harness
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
main to based #3
Open
sedrick-keh-tri
wants to merge
282
commits into
based-fork-2
Choose a base branch
from
main
base: based-fork-2
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* use `@ray.remote` with distributed vLLM * update versions * bugfix * unpin vllm * fix pre-commit * added version assertion error * Revert "added version assertion error" This reverts commit 8041e9b. * added version assertion for DP * expand DP note * add warning * nit * pin vllm * fix typos
…ity (EleutherAI#1487) * setting trust_remote_code * dataset list no notebooks * respect trust remote code * Address changes, move cli options and change datasets * fix task for tests * headqa * remove kobest * pin datasets and address comments * clean up space
* add french-bench * rename arc easy * linting * update datasets for no remote code exec * fix string delimiter * add info to readmr * trim trailing whitespace * add detailed groups * add info to readme * remove orangesum title from fbench main * Force PPL tasks to be 0-shot --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
* Fix padding * Fix elif in model loading * format
* Add new tasks of GPQA * Add README * Remove unused functions * Remove unused functions * Linters * Add flexible match * update * Remove deplicate function * Linter * update * Update lm_eval/filters/extraction.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * register multi_choice_regex * Update * run precommit --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
* Start adding eq-bench * Start adding to yaml and utils * Get metric working * Add README * Handle cases where answer is not parseable * Deal with unparseable answers and add percent_parseable metric * Update README
* init wmdp yaml file * Add WMDP Multiple-choice * fix linter issues * Delete lm_eval/tasks/wmdp/_wmdp.yaml --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com>
…used by cot which hardcodes fewshot prompt (EleutherAI#1502)
…eutherAI#1533) * Remove unused `decontamination_ngrams_path` and all mentions (still no alternative path provided) * Fix improper import of LM and usage of evaluator in one of scripts * update type hints in instance and task api * raising errors in task.py instead of asserts * Fix warnings from ruff * raising errors in __main__.py instead of asserts * raising errors in tasks/__init__.py instead of asserts * raising errors in evaluator.py instead of asserts * evaluator: update type hints and remove unused variables in code * Update lm_eval/__main__.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/__main__.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/evaluator.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * pre-commit induced fixes --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
…g document and, update wandb_args description (EleutherAI#1536) * Update openai completions and docs/CONTRIBUTING.md * Update wandb args description * Update docs/interface.md --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
* Add compatibility for vLLM's new Logprob object * Fix * Update lm_eval/models/vllm_causallms.py * fix format? * trailing whitespace --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
…leutherAI#1551) * update gen_kwargs in code2-text-go.yaml * update gen_kwargs in rest code2-text
* Support jinja templating for "description" * Update task_guide.md * Update lm_eval/api/task.py * fix format? * whitespace errors * fix whitespace * fix bad variable reference --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
* add Arabic EXAMS benchmark * fixed the linter issue, and add more information on the readme * Update README.md --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com>
* add agieval * fix typo * add cloze / math exactmatch agieval tasks, rename * update exact-match agieval tasks, allow for multiple-correct answers * add more detail to readme * don't parse_math_answer twice --------- Co-authored-by: Alex Bäuerle <alex@a13x.io>
…ng the checkpoint.
swde and fda
* Update IFEval dataset to official one This PR updates the IFEval dataset to the one hosted under the Google org: https://huggingface.co/datasets/google/IFEval Note the main change is an updated prompt from this commit in the GitHub repo: google-research/google-research@26d8ccd * Update ifeval.yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
* multiple chat template support * help doc update * add transformers link to docstring * model args update * comment update * statement simplification * simplified chat_template property * docs update * removed template arg from HFLM class * interface doc update * model guide update * interface doc update * reuse apply_chat_template variable * model guide refactor * interface doc update * removed old definition * last nits * last nits * last nits * better wording * last nits * Remove unnecessary Optional * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * return variable rename --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
* fix: arguments data * fix based on comment * Update zeno_visualize.py updated all output types --------- Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
* mela * Update mela_en.yaml * Create _mela.yaml --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
* fix the regex string in yaml file * Update samplers.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
ACLUE bibtex typo reported to ACL Anthology and fixed here as title in pdf is correct.
* Created DUP eval code for gsm8k * asdiv * Fixed fewshot=8 issue * added results to .gitignore * reverted unnecessary changes and moved results + gsm8k_dup out of repo to prepare for pull req * fixed whitespace and unintentional hardcoded version change information * created mbpp task * Reverted changes re. mbpp to save for a future Pull req * reverted metrics.py to previous commit * updated asdiv readme to include informaiton about new asdiv_cot_llama task * Apply suggestions from code review --------- Co-authored-by: Alexander Detkov <alexander.d.detkov@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
* chat template hotfix * pre-commit
…#2258) * Update evaluator.py * update error msg
* max_length - 1 (generation always >= 1) * vllm: fix rolling prefix_token * nit: add comment * fixup! max_length should be handled for logliklihoods
* max_length - 1 (generation always >= 1) * vllm: fix rolling prefix_token * nit: add comment * fixup! max_length should be handled for logliklihoods * Revert "fixup! max_length should be handled for logliklihoods" This reverts commit 432d1a3.
* default chat template method fix * move chat_template to TemplateLM * remove hotfix * handle openai `chat_template` * Update lm_eval/api/model.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * add 'max_tokens' to gen_kwargs * pre-commit --------- Co-authored-by: KonradSzafer <szafer.konrad@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
…leutherAI#2232) * arabic leaferboard yaml file is added * arabic toxigen is implemented * Dataset library is imported * arabic sciq is added * util file of arabic toxigen is updated * arabic race is added * arabic piqa is implemented * arabic open qa is added * arabic copa is implemented * arabic boolq ia added * arabic arc easy is added * arabic arc challenge is added * arabic exams benchmark is implemented * arabic hellaswag is added * arabic leaderboard yaml file metrics are updated * arabic mmlu benchmarks are added * arabic mmlu group yaml file is updated * alghafa benchmarks are added * acva benchmarks are added * acva utils.py is updated * light version of arabic leaderboard benchmarks are added * bugs fixed * bug fixed * bug fixed * bug fixed * bug fixed * bug fixed * library import bug is fixed * doc to target updated * bash file is deleted * results folder is deleted * leaderboard groups are added * full arabic leaderboard groups are added, plus some bug fixes to the light version * Create README.md README.md for arabic_leaderboard_complete * Create README.md README.md for arabic_leaderboard_light * Delete lm_eval/tasks/arabic_leaderboard directory * Update README.md * Update README.md adding the Arabic leaderboards to the library * Update README.md 10% of the training set * Update README.md 10% of the training set * revert .gitignore to prev version * Update lm_eval/tasks/README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * updated main README.md * Update lm_eval/tasks/README.md * specify machine translated benchmarks (complete) * specify machine translated benchmarks (light version) * add alghafa to the related task names (complete and light) * add 'acva' to the related task names (complete and light) * add 'arabic_leaderboard' to all the groups (complete and light) * all dataset - not a random sample * added more accurate details to the readme file * added mt_mmlu from okapi * Update lm_eval/tasks/README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/tasks/README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * updated mt_mmlu readme * renaming 'alghafa' full and light * renaming 'arabic_mmlu' light and full * renaming 'acva' full and light * update readme and standardize dir/file names * running pre-commit --------- Co-authored-by: shahrzads <sayehban@ualberta.ca> Co-authored-by: shahrzads <56282669+shahrzads@users.noreply.github.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
* add WIP hf vlm class * add doc_to_image * add mmmu tasks * fix merge conflicts * add lintang's changes to hf_vlms.py * fix doc_to_image * added yaml_path for config-loading * revert * add line to process str type v * update * modeling cleanup * add aggregation for mmmu * rewrite MMMU processing code based on only MMMU authors' repo (doc_to_image still WIP) * implemented doc_to_image * update doc_to_image to accept list of features * update functions * readd image processed * update args process * bugfix for repeated images fed to model * push WIP loglikelihood code * commit most recent code (generative ; qwen2-vl testing) * preliminary image_token_id handling * small mmmu update: some qs have >4 mcqa options * push updated modeling code * use processor.apply_chat_template * add mathvista draft * nit * nit * ensure no footguns in text<>multimodal LM<>task incompatibility * add notification to readme regarding launch of prototype! * fix compatibility check * reorganize mmmu configs * chat_template=None * add interleave chat_template * add condition * add max_images; interleave=true * nit * testmini_mcq * nit * pass image string; convert img * add vllm * add init * vlm add multi attr * fixup * pass max images to vllm model init * nit * encoding to device * fix HFMultimodalLM.chat_template ? * add mmmu readme * remove erroneous prints * use HFMultimodalLM.chat_template ; restore tasks/__init__.py * add docstring for replace_placeholders in utils * fix `replace_placeholders`; set image_string=None * fix typo * cleanup + fix merge conflicts * update MMMU readme * del mathvista * add some sample scores * Update README.md * add log msg for image_string value --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Baber Abbasi <baber@eleuther.ai> Co-authored-by: Baber <baber@hey.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
* Update README.md I encounter some Git buffer size limits when trying to download all commits history of the repository, such as: ```error: RPC failed; curl 18 transfer closed with outstanding read data remaining error: 5815 bytes of body are still expected fetch-pack: unexpected disconnect while reading sideband packet fatal: early EOF``` therefore the installation is faster and there are not errors when I download only the last version of the repository * Fix linting issue
* feat(neuron): align with latest optimum-neuron * feat(neuron): support pre-exported neuron models * fix(neuron): correctly use max_length * fix(neuron): adapt loglikelihood The evaluation of log likelihood was not working for neuron models using continuous batching, such as all cached neuron LLama models. * refactor(neuron): remove dead code
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.