[Repr] Provide reproduce environment and descriptions for llava-1.5 (#62

) * Refactor logging in lmms_eval package * Refactor variable names in lmms_eval package * Update README.md with new features and installation instructions * Update supported models and datasets * Delete otter.py file * Fix capitalization in README.md * Update image sizes and add new features * Refactor README.md to improve readability and add new features * Add description for lmms-eval in README.md * Update accelerator support in README.md * Update lmms-eval README with improved description and additional features * Update README.md with improved task grouping description * change `Otter-AI/MME` to `lmms-lab/MME` * Update README.md * Update README.md * Remove unused code in mme.yaml * Squashed commit of the following: commit 6b20902 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Thu Feb 29 13:40:02 2024 +0800 Dev/py add models (#57) * add instructblip * minicpm_v * remove <image> from qwen-vl * speed up postprocessing * Optimize build context speed --------- Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg> Co-authored-by: kcz358 <kaichenzhang358@outlook.com> commit 21050ba Author: Pu Fanyi <FPU001@e.ntu.edu.sg> Date: Wed Feb 28 14:49:07 2024 +0800 Pufanyi/flickr30k refractor (#56) * refactor vizwizvqa task * Delete vqav2_test and vqav2_val YAML files * Refactor vqav2_process_results functions * Add a pack for vqav2 * refactor okvqa * roll back vizwiz_vqa * Fix exact_match calculation in ok_vqa_process_results * Update OKVQA dataset name in readme * add model_specific_prompt_kwargs * add model_specific_prompt_kwargs to vizwiz_vqa * add model_specific_prompt_kwargs for vqav2 * lint * fix a small bug for eval_logger * Refactor make_table function to display points as " - " if value is None * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7' * Refactor ok_vqa_aggreate_submissions function * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff' * Refactor VQA submission file saving * Update file utils * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0' * Refactor file path handling and submission generation * OKVQA path * vizwizvqa file * pack cmmmu * fix a small metric bug for cmmmu * Add higher_is_better flag to submission metric * Add CMMMU dataset to README.md * Add logging and refactor submission file generation in docvqa utils.py * pack docvqa * add traceback to print detailed error * Refactor docvqa_test_aggregate_results to accept additional arguments * Add metric check in evaluator.py and update test.yaml and val.yaml * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2 * merge textvqa * textvqa * Modify submission file generation for COCO test results * Update test result storage path * update coco cap file name * Update COCO 2017 Caption dataset name * ferret * Add Ferret dataset * Refactor hb_doc_to_text function to include model-specific prompts * Add IconQA and its subtasks * Refactor image list creation in doc_to_visual function * Add process_results function to default template * Update process_results function in iconqa utils.py * refactor flickr30k * change aggregation function * Fix formatting issues and update logging message * Fix llava can not handle only text question (no visuals) * Fix qwen can not handle no image question (no visuals) * Add fuyu prepare accelerator scripts * refactor mme * naming consistency * aggregation_submissions consistency * flickr30k naming consistency * remove submissions for mme * remove unused submission function * Refactor infovqa_test.yaml and infovqa_val.yaml * Refactor code for improved readability and maintainability * stvqa * remane sqa * Update lmms_eval textcaps files and utils.py * Update default prompt for text captions * Refactor textcaps_aggregation_result function * Add generate_submission_file function and update mathvista_aggregate_results signature * Update nocaps_test.yaml and nocaps_val.yaml * refractor internal_eval * Add internal evaluation datasets * pack multidocvqa * mmvet * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating * Refractor llava wild * Refractor llava-bench-coco * Add JSON file generation for gpt evaluation details * mmmu * Remove MMBench English and Chinese tasks * Remove unnecessary return statement in mmbench_aggregate_test_results function * Fix distributed process group initialization * Update dataset paths and group names in mmbench test configs * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py * Add torch module import * lint * Remove IconQA dataset from README.md * Add Multi-DocVQA and its submodules * Add new datasets and update task names * Refactor flickr_aggregation_result function to accept additional arguments * Add timeout kwargs in Accelerator constructor * Add encoding to be utf-8 for cmmmu * Fix llava try and catch, remove torch.distributed.init in main * Ds prepare script for llava --------- Co-authored-by: JvThunder <joshuaadrianc@gmail.com> Co-authored-by: kcz358 <kaichenzhang358@outlook.com> commit ba0e7f5 Author: Li Bo <drluodian@gmail.com> Date: Tue Feb 27 22:52:07 2024 +0800 [Wandb Logger] add models, and args to wandb tables. (#55) * Refactor logging in lmms_eval package * Refactor variable names in lmms_eval package * add llava main in pyproject * Update README.md * Remove unnecessary dependencies and add specific version for llava_repr * Add dependencies for llava_repr*** * Update README.md * add some docs on models and command line commands * remove some lines * typo * Update model_guide.md * Update model_guide.md * Update README.md * Update README.md * Update README.md * Fix refcocog dataset path * Record gpt response in eval info * Resolve conflict * Fix hallusionbench gpt json saving path * Rename hallubench gpt output path * Change remove image to check by type instead of check by names * More robust check by type * Add timeout to API requests * Remove unnecessary img in data * Forcing an empty commit. * Testing * Delete unnecessary things * Fix error logging in get_chat_response function * Fix seedbench2 image issue in doc_to_text * Add conditional exclude for internal eval * Squashed commit of the following: commit faf9cf65cf5b1e036ee3a74428e8bb1490e8b2eb Author: kcz358 <kaichenzhang358@outlook.com> Date: Sat Mar 2 03:49:36 2024 +0000 Add conditional exclude for internal eval commit e3729eb925b718a44b6eb225ef9b41c7fd2408e0 Merge: a3cae8e ffb9eb2 Author: kcz358 <kaichenzhang358@outlook.com> Date: Sat Mar 2 03:24:29 2024 +0000 Merge branch 'dev/readme' into kc/final_fix commit 50b697a7ae93b0547484e1cd753722c1d2513349 Author: kcz358 <kaichenzhang358@outlook.com> Date: Sat Mar 2 02:47:31 2024 +0000 Fix seedbench2 image issue in doc_to_text commit 17425b5dce41cf67b96c5875139b57d6c7a423df Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 15:32:49 2024 +0000 Delete unnecessary things commit 1bc17d54e79e79d11419ba89e7d8e55bc8cfa21b Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 15:31:42 2024 +0000 Testing commit a20bbc30ab576d3e2a587c70af1b7c06575bcd8b Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 15:29:30 2024 +0000 Forcing an empty commit. commit e2b657694b888ef59b9f896415e7c4c82497e7bf Merge: 786f2b5 1700786 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 15:24:56 2024 +0000 Merge branch 'kc/final_fix' into dev/readme commit 6447d521842b9f83f5119cdcd7714c8f6053ca73 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 15:24:20 2024 +0000 Remove unnecessary img in data commit 8ac333a2e9ebbe6318d536b6589f767f71fbc092 Merge: 4240785 888c1c1 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 13:41:24 2024 +0000 Merge branch 'kc/final_fix' into dev/readme commit 9e542ce049f68f49a237be165e3ad9cde7408ac0 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 13:40:51 2024 +0000 More robust check by type commit f90ccf7b94b130e118b4eca321f68b81e7ab5850 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 13:00:57 2024 +0000 Change remove image to check by type instead of check by names commit f651a77707a4c723ebffb07f2a87743bf42ecea7 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 12:33:02 2024 +0000 Rename hallubench gpt output path commit a683559c704806b7abde5e4c8355f556f3e65866 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 09:32:52 2024 +0000 Fix hallusionbench gpt json saving path commit 8e246e2466f3dd14a5e34f720269d7991a6dcf6b Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 08:51:13 2024 +0000 Resolve conflict commit 67f00dc4652d09c662e5202ff7e5fbf7bebcdaf6 Merge: 9cf86fa 93534dc Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 08:37:21 2024 +0000 Merge branch 'kc/final_fix' into dev/readme commit 53b7a845fe8412a652905101ec036c84e77a20c2 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 07:55:03 2024 +0000 Record gpt response in eval info commit 920b4112c4508e9a8afe824678958f2e78189e4e Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 07:49:01 2024 +0000 Fix refcocog dataset path commit 6b20902 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Thu Feb 29 13:40:02 2024 +0800 Dev/py add models (#57) * add instructblip * minicpm_v * remove <image> from qwen-vl * speed up postprocessing * Optimize build context speed --------- Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg> Co-authored-by: kcz358 <kaichenzhang358@outlook.com> commit 21050ba Author: Pu Fanyi <FPU001@e.ntu.edu.sg> Date: Wed Feb 28 14:49:07 2024 +0800 Pufanyi/flickr30k refractor (#56) * refactor vizwizvqa task * Delete vqav2_test and vqav2_val YAML files * Refactor vqav2_process_results functions * Add a pack for vqav2 * refactor okvqa * roll back vizwiz_vqa * Fix exact_match calculation in ok_vqa_process_results * Update OKVQA dataset name in readme * add model_specific_prompt_kwargs * add model_specific_prompt_kwargs to vizwiz_vqa * add model_specific_prompt_kwargs for vqav2 * lint * fix a small bug for eval_logger * Refactor make_table function to display points as " - " if value is None * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7' * Refactor ok_vqa_aggreate_submissions function * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff' * Refactor VQA submission file saving * Update file utils * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0' * Refactor file path handling and submission generation * OKVQA path * vizwizvqa file * pack cmmmu * fix a small metric bug for cmmmu * Add higher_is_better flag to submission metric * Add CMMMU dataset to README.md * Add logging and refactor submission file generation in docvqa utils.py * pack docvqa * add traceback to print detailed error * Refactor docvqa_test_aggregate_results to accept additional arguments * Add metric check in evaluator.py and update test.yaml and val.yaml * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2 * merge textvqa * textvqa * Modify submission file generation for COCO test results * Update test result storage path * update coco cap file name * Update COCO 2017 Caption dataset name * ferret * Add Ferret dataset * Refactor hb_doc_to_text function to include model-specific prompts * Add IconQA and its subtasks * Refactor image list creation in doc_to_visual function * Add process_results function to default template * Update process_results function in iconqa utils.py * refactor flickr30k * change aggregation function * Fix formatting issues and update logging message * Fix llava can not handle only text question (no visuals) * Fix qwen can not handle no image question (no visuals) * Add fuyu prepare accelerator scripts * refactor mme * naming consistency * aggregation_submissions consistency * flickr30k naming consistency * remove submissions for mme * remove unused submission function * Refactor infovqa_test.yaml and infovqa_val.yaml * Refactor code for improved readability and maintainability * stvqa * remane sqa * Update lmms_eval textcaps files and utils.py * Update default prompt for text captions * Refactor textcaps_aggregation_result function * Add generate_submission_file function and update mathvista_aggregate_results signature * Update nocaps_test.yaml and nocaps_val.yaml * refractor internal_eval * Add internal evaluation datasets * pack multidocvqa * mmvet * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating * Refractor llava wild * Refractor llava-bench-coco * Add JSON file generation for gpt evaluation details * mmmu * Remove MMBench English and Chinese tasks * Remove unnecessary return statement in mmbench_aggregate_test_results function * Fix distributed process group initialization * Update dataset paths and group names in mmbench test configs * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py * Add torch module import * lint * Remove IconQA dataset from README.md * Add Multi-DocVQA and its submodules * Add new datasets and update task names * Refactor flickr_aggregation_result function to accept additional arguments * Add timeout kwargs in Accelerator constructor * Add encoding to be utf-8 for cmmmu * Fix llava try and catch, remove torch.distributed.init in main * Ds prepare script for llava --------- Co-authored-by: JvThunder <joshuaadrianc@gmail.com> Co-authored-by: kcz358 <kaichenzhang358@outlook.com> commit ba0e7f5 Author: Li Bo <drluodian@gmail.com> Date: Tue Feb 27 22:52:07 2024 +0800 [Wandb Logger] add models, and args to wandb tables. (#55) * Refactor logging in lmms_eval package * Refactor variable names in lmms_eval package * Fix small bugs in list_with_num * Revise list_with_num model args * Dev/readme rm rolling (#60) * remove log_likelyhood_rolling * Update time efficiency benchmark in README.md * add task guide --------- Co-authored-by: jzhang38 <a1286225768@gmail.com> Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com> * Remove unnecessary code and update dependencies * Fix logging utils bug on wandb grouping * Add reproduce envs * Squashed commit of the following: commit 74fff73053b88a90d0f4229a9c748256080fea08 Merge: 2475639 f89a736 Author: kcz358 <kaichenzhang358@outlook.com> Date: Sun Mar 3 22:12:12 2024 +0800 Merge branch 'main' into kc/final_fix commit 0c640a636e3882859a17e30a5c3504850a3d02d6 Author: kcz358 <kaichenzhang358@outlook.com> Date: Sun Mar 3 22:11:04 2024 +0800 Add reproduce envs commit 7f2b2c3 Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Mar 3 21:19:15 2024 +0800 [Fix] wandb group logging missing columns (#61) * Refactor logging in lmms_eval package * Refactor variable names in lmms_eval package * Update README.md with new features and installation instructions * Update supported models and datasets * Delete otter.py file * Fix capitalization in README.md * Update image sizes and add new features * Refactor README.md to improve readability and add new features * Add description for lmms-eval in README.md * Update accelerator support in README.md * Update lmms-eval README with improved description and additional features * Update README.md with improved task grouping description * change `Otter-AI/MME` to `lmms-lab/MME` * Update README.md * Update README.md * Remove unused code in mme.yaml * Squashed commit of the following: commit 6b20902 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Thu Feb 29 13:40:02 2024 +0800 Dev/py add models (#57) * add instructblip * minicpm_v * remove <image> from qwen-vl * speed up postprocessing * Optimize build context speed --------- Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg> Co-authored-by: kcz358 <kaichenzhang358@outlook.com> commit 21050ba Author: Pu Fanyi <FPU001@e.ntu.edu.sg> Date: Wed Feb 28 14:49:07 2024 +0800 Pufanyi/flickr30k refractor (#56) * refactor vizwizvqa task * Delete vqav2_test and vqav2_val YAML files * Refactor vqav2_process_results functions * Add a pack for vqav2 * refactor okvqa * roll back vizwiz_vqa * Fix exact_match calculation in ok_vqa_process_results * Update OKVQA dataset name in readme * add model_specific_prompt_kwargs * add model_specific_prompt_kwargs to vizwiz_vqa * add model_specific_prompt_kwargs for vqav2 * lint * fix a small bug for eval_logger * Refactor make_table function to display points as " - " if value is None * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7' * Refactor ok_vqa_aggreate_submissions function * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff' * Refactor VQA submission file saving * Update file utils * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0' * Refactor file path handling and submission generation * OKVQA path * vizwizvqa file * pack cmmmu * fix a small metric bug for cmmmu * Add higher_is_better flag to submission metric * Add CMMMU dataset to README.md * Add logging and refactor submission file generation in docvqa utils.py * pack docvqa * add traceback to print detailed error * Refactor docvqa_test_aggregate_results to accept additional arguments * Add metric check in evaluator.py and update test.yaml and val.yaml * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2 * merge textvqa * textvqa * Modify submission file generation for COCO test results * Update test result storage path * update coco cap file name * Update COCO 2017 Caption dataset name * ferret * Add Ferret dataset * Refactor hb_doc_to_text function to include model-specific prompts * Add IconQA and its subtasks * Refactor image list creation in doc_to_visual function * Add process_results function to default template * Update process_results function in iconqa utils.py * refactor flickr30k * change aggregation function * Fix formatting issues and update logging message * Fix llava can not handle only text question (no visuals) * Fix qwen can not handle no image question (no visuals) * Add fuyu prepare accelerator scripts * refactor mme * naming consistency * aggregation_submissions consistency * flickr30k naming consistency * remove submissions for mme * remove unused submission function * Refactor infovqa_test.yaml and infovqa_val.yaml * Refactor code for improved readability and maintainability * stvqa * remane sqa * Update lmms_eval textcaps files and utils.py * Update default prompt for text captions * Refactor textcaps_aggregation_result function * Add generate_submission_file function and update mathvista_aggregate_results signature * Update nocaps_test.yaml and nocaps_val.yaml * refractor internal_eval * Add internal evaluation datasets * pack multidocvqa * mmvet * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating * Refractor llava wild * Refractor llava-bench-coco * Add JSON file generation for gpt evaluation details * mmmu * Remove MMBench English and Chinese tasks * Remove unnecessary return statement in mmbench_aggregate_test_results function * Fix distributed process group initialization * Update dataset paths and group names in mmbench test configs * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py * Add torch module import * lint * Remove IconQA dataset from README.md * Add Multi-DocVQA and its submodules * Add new datasets and update task names * Refactor flickr_aggregation_result function to accept additional arguments * Add timeout kwargs in Accelerator constructor * Add encoding to be utf-8 for cmmmu * Fix llava try and catch, remove torch.distributed.init in main * Ds prepare script for llava --------- Co-authored-by: JvThunder <joshuaadrianc@gmail.com> Co-authored-by: kcz358 <kaichenzhang358@outlook.com> commit ba0e7f5 Author: Li Bo <drluodian@gmail.com> Date: Tue Feb 27 22:52:07 2024 +0800 [Wandb Logger] add models, and args to wandb tables. (#55) * Refactor logging in lmms_eval package * Refactor variable names in lmms_eval package * add llava main in pyproject * Update README.md * Remove unnecessary dependencies and add specific version for llava_repr * Add dependencies for llava_repr*** * Update README.md * add some docs on models and command line commands * remove some lines * typo * Update model_guide.md * Update model_guide.md * Update README.md * Update README.md * Update README.md * Fix refcocog dataset path * Record gpt response in eval info * Resolve conflict * Fix hallusionbench gpt json saving path * Rename hallubench gpt output path * Change remove image to check by type instead of check by names * More robust check by type * Remove unnecessary img in data * Forcing an empty commit. * Testing * Delete unnecessary things * Fix seedbench2 image issue in doc_to_text * Add conditional exclude for internal eval * Fix small bugs in list_with_num * Revise list_with_num model args * Fix logging utils bug on wandb grouping --------- Co-authored-by: Bo Li <drluodian@gmail.com> Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg> Co-authored-by: jzhang38 <a1286225768@gmail.com> commit bebff9fad2a60bc0ac52ddc430e5d9e4e0ef6c24 Merge: 83358a4 5e1c9c7 Author: kcz358 <kaichenzhang358@outlook.com> Date: Sun Mar 3 07:25:48 2024 +0000 Merge branch 'main' into kc/final_fix commit 5042bb0c2ed4f830dda6bcd14231b1f8763aa95f Author: kcz358 <kaichenzhang358@outlook.com> Date: Sun Mar 3 07:23:19 2024 +0000 Fix logging utils bug on wandb grouping commit c82042b Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Mar 3 13:01:11 2024 +0800 [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (#59) * Refactor logging in lmms_eval package * Refactor variable names in lmms_eval package * Update README.md with new features and installation instructions * Update supported models and datasets * Delete otter.py file * Fix capitalization in README.md * Update image sizes and add new features * Refactor README.md to improve readability and add new features * Add description for lmms-eval in README.md * Update accelerator support in README.md * Update lmms-eval README with improved description and additional features * Update README.md with improved task grouping description * change `Otter-AI/MME` to `lmms-lab/MME` * Update README.md * Update README.md * Remove unused code in mme.yaml * Squashed commit of the following: commit 6b20902 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Thu Feb 29 13:40:02 2024 +0800 Dev/py add models (#57) * add instructblip * minicpm_v * remove <image> from qwen-vl * speed up postprocessing * Optimize build context speed --------- Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg> Co-authored-by: kcz358 <kaichenzhang358@outlook.com> commit 21050ba Author: Pu Fanyi <FPU001@e.ntu.edu.sg> Date: Wed Feb 28 14:49:07 2024 +0800 Pufanyi/flickr30k refractor (#56) * refactor vizwizvqa task * Delete vqav2_test and vqav2_val YAML files * Refactor vqav2_process_results functions * Add a pack for vqav2 * refactor okvqa * roll back vizwiz_vqa * Fix exact_match calculation in ok_vqa_process_results * Update OKVQA dataset name in readme * add model_specific_prompt_kwargs * add model_specific_prompt_kwargs to vizwiz_vqa * add model_specific_prompt_kwargs for vqav2 * lint * fix a small bug for eval_logger * Refactor make_table function to display points as " - " if value is None * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7' * Refactor ok_vqa_aggreate_submissions function * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff' * Refactor VQA submission file saving * Update file utils * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0' * Refactor file path handling and submission generation * OKVQA path * vizwizvqa file * pack cmmmu * fix a small metric bug for cmmmu * Add higher_is_better flag to submission metric * Add CMMMU dataset to README.md * Add logging and refactor submission file generation in docvqa utils.py * pack docvqa * add traceback to print detailed error * Refactor docvqa_test_aggregate_results to accept additional arguments * Add metric check in evaluator.py and update test.yaml and val.yaml * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2 * merge textvqa * textvqa * Modify submission file generation for COCO test results * Update test result storage path * update coco cap file name * Update COCO 2017 Caption dataset name * ferret * Add Ferret dataset * Refactor hb_doc_to_text function to include model-specific prompts * Add IconQA and its subtasks * Refactor image list creation in doc_to_visual function * Add process_results function to default template * Update process_results function in iconqa utils.py * refactor flickr30k * change aggregation function * Fix formatting issues and update logging message * Fix llava can not handle only text question (no visuals) * Fix qwen can not handle no image question (no visuals) * Add fuyu prepare accelerator scripts * refactor mme * naming consistency * aggregation_submissions consistency * flickr30k naming consistency * remove submissions for mme * remove unused submission function * Refactor infovqa_test.yaml and infovqa_val.yaml * Refactor code for improved readability and maintainability * stvqa * remane sqa * Update lmms_eval textcaps files and utils.py * Update default prompt for text captions * Refactor textcaps_aggregation_result function * Add generate_submission_file function and update mathvista_aggregate_results signature * Update nocaps_test.yaml and nocaps_val.yaml * refractor internal_eval * Add internal evaluation datasets * pack multidocvqa * mmvet * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating * Refractor llava wild * Refractor llava-bench-coco * Add JSON file generation for gpt evaluation details * mmmu * Remove MMBench English and Chinese tasks * Remove unnecessary return statement in mmbench_aggregate_test_results function * Fix distributed process group initialization * Update dataset paths and group names in mmbench test configs * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py * Add torch module import * lint * Remove IconQA dataset from README.md * Add Multi-DocVQA and its submodules * Add new datasets and update task names * Refactor flickr_aggregation_result function to accept additional arguments * Add timeout kwargs in Accelerator constructor * Add encoding to be utf-8 for cmmmu * Fix llava try and catch, remove torch.distributed.init in main * Ds prepare script for llava --------- Co-authored-by: JvThunder <joshuaadrianc@gmail.com> Co-authored-by: kcz358 <kaichenzhang358@outlook.com> commit ba0e7f5 Author: Li Bo <drluodian@gmail.com> Date: Tue Feb 27 22:52:07 2024 +0800 [Wandb Logger] add models, and args to wandb tables. (#55) * Refactor logging in lmms_eval package * Refactor variable names in lmms_eval package * add llava main in pyproject * Update README.md * Remove unnecessary dependencies and add specific version for llava_repr * Add dependencies for llava_repr*** * Update README.md * add some docs on models and command line commands * remove some lines * typo * Update model_guide.md * Update model_guide.md * Update README.md * Update README.md * Update README.md * Fix refcocog dataset path * Record gpt response in eval info * Resolve conflict * Fix hallusionbench gpt json saving path * Rename hallubench gpt output path * Change remove image to check by type instead of check by names * More robust check by type * Remove unnecessary img in data * Forcing an empty commit. * Testing * Delete unnecessary things * Fix seedbench2 image issue in doc_to_text * Add conditional exclude for internal eval * Fix small bugs in list_with_num * Revise list_with_num model args --------- Co-authored-by: Bo Li <drluodian@gmail.com> Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg> Co-authored-by: jzhang38 <a1286225768@gmail.com> commit d78a3d7a53f5285a7eac39ce8f04e9854fdb3e73 Author: kcz358 <kaichenzhang358@outlook.com> Date: Sat Mar 2 05:58:08 2024 +0000 Revise list_with_num model args commit 8eefaec8489d48613de9395eb8e8150224985e01 Author: kcz358 <kaichenzhang358@outlook.com> Date: Sat Mar 2 05:09:15 2024 +0000 Fix small bugs in list_with_num commit faf9cf65cf5b1e036ee3a74428e8bb1490e8b2eb Author: kcz358 <kaichenzhang358@outlook.com> Date: Sat Mar 2 03:49:36 2024 +0000 Add conditional exclude for internal eval commit e3729eb925b718a44b6eb225ef9b41c7fd2408e0 Merge: a3cae8e ffb9eb2 Author: kcz358 <kaichenzhang358@outlook.com> Date: Sat Mar 2 03:24:29 2024 +0000 Merge branch 'dev/readme' into kc/final_fix commit 50b697a7ae93b0547484e1cd753722c1d2513349 Author: kcz358 <kaichenzhang358@outlook.com> Date: Sat Mar 2 02:47:31 2024 +0000 Fix seedbench2 image issue in doc_to_text commit 17425b5dce41cf67b96c5875139b57d6c7a423df Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 15:32:49 2024 +0000 Delete unnecessary things commit 1bc17d54e79e79d11419ba89e7d8e55bc8cfa21b Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 15:31:42 2024 +0000 Testing commit a20bbc30ab576d3e2a587c70af1b7c06575bcd8b Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 15:29:30 2024 +0000 Forcing an empty commit. commit e2b657694b888ef59b9f896415e7c4c82497e7bf Merge: 786f2b5 1700786 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 15:24:56 2024 +0000 Merge branch 'kc/final_fix' into dev/readme commit 6447d521842b9f83f5119cdcd7714c8f6053ca73 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 15:24:20 2024 +0000 Remove unnecessary img in data commit 8ac333a2e9ebbe6318d536b6589f767f71fbc092 Merge: 4240785 888c1c1 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 13:41:24 2024 +0000 Merge branch 'kc/final_fix' into dev/readme commit 9e542ce049f68f49a237be165e3ad9cde7408ac0 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 13:40:51 2024 +0000 More robust check by type commit f90ccf7b94b130e118b4eca321f68b81e7ab5850 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 13:00:57 2024 +0000 Change remove image to check by type instead of check by names commit f651a77707a4c723ebffb07f2a87743bf42ecea7 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 12:33:02 2024 +0000 Rename hallubench gpt output path commit a683559c704806b7abde5e4c8355f556f3e65866 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 09:32:52 2024 +0000 Fix hallusionbench gpt json saving path commit 8e246e2466f3dd14a5e34f720269d7991a6dcf6b Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 08:51:13 2024 +0000 Resolve conflict commit 67f00dc4652d09c662e5202ff7e5fbf7bebcdaf6 Merge: 9cf86fa 93534dc Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 08:37:21 2024 +0000 Merge branch 'kc/final_fix' into dev/readme commit 53b7a845fe8412a652905101ec036c84e77a20c2 Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 07:55:03 2024 +0000 Record gpt response in eval info commit 920b4112c4508e9a8afe824678958f2e78189e4e Author: kcz358 <kaichenzhang358@outlook.com> Date: Fri Mar 1 07:49:01 2024 +0000 Fix refcocog dataset path commit 6b20902 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Thu Feb 29 13:40:02 2024 +0800 Dev/py add models (#57) * add instructblip * minicpm_v * remove <image> from qwen-vl * speed up postprocessing * Optimize build context speed --------- Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg> Co-authored-by: kcz358 <kaichenzhang358@outlook.com> commit 21050ba Author: Pu Fanyi <FPU001@e.ntu.edu.sg> Date: Wed Feb 28 14:49:07 2024 +0800 Pufanyi/flickr30k refractor (#56) * refactor vizwizvqa task * Delete vqav2_test and vqav2_val YAML files * Refactor vqav2_process_results functions * Add a pack for vqav2 * refactor okvqa * roll back vizwiz_vqa * Fix exact_match calculation in ok_vqa_process_results * Update OKVQA dataset name in readme * add model_specific_prompt_kwargs * add model_specific_prompt_kwargs to vizwiz_vqa * add model_specific_prompt_kwargs for vqav2 * lint * fix a small bug for eval_logger * Refactor make_table function to display points as " - " if value is None * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7' * Refactor ok_vqa_aggreate_submissions function * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff' * Refactor VQA submission file saving * Update file utils * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0' * Refactor file path handling and submission generation * OKVQA path * vizwizvqa file * pack cmmmu * fix a small metric bug for cmmmu * Add higher_is_better flag to submission metric * Add CMMMU dataset to README.md * Add logging and refactor submission file generation in docvqa utils.py * pack docvqa * add traceback to print detailed error * Refactor docvqa_test_aggregate_results to accept additional arguments * Add metric check in evaluator.py and update test.yaml and val.yaml * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2 * merge textvqa * textvqa * Modify submission file generation for COCO test results * Update test result storage path * update coco cap file name * Update COCO 2017 Caption dataset name * ferret * Add Ferret dataset * Refactor hb_doc_to_text function to include model-specific prompts * Add IconQA and its subtasks * Refactor image list creation in doc_to_visual function * Add process_results function to default template * Update process_results function in iconqa utils.py * refactor flickr30k * change aggregation function * Fix formatting issues and update logging message * Fix llava can not handle only text question (no visuals) * Fix qwen can not handle no image question (no visuals) * Add fuyu prepare accelerator scripts * refactor mme * naming consistency * aggregation_submissions consistency * flickr30k naming consistency * remove submissions for mme * remove unused submission function * Refactor infovqa_test.yaml and infovqa_val.yaml * Refactor code for improved readability and maintainability * stvqa * remane sqa * Update lmms_eval textcaps files and utils.py * Update default prompt for text captions * Refactor textcaps_aggregation_result function * Add generate_submission_file function and update mathvista_aggregate_results signature * Update nocaps_test.yaml and nocaps_val.yaml * refractor internal_eval * Add internal evaluation datasets * pack multidocvqa * mmvet * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating * Refractor llava wild * Refractor llava-bench-coco * Add JSON file generation for gpt evaluation details * mmmu * Remove MMBench English and Chinese tasks * Remove unnecessary return statement in mmbench_aggregate_test_results function * Fix distributed process group initialization * Update dataset paths and group names in mmbench test configs * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py * Add torch module import * lint * Remove IconQA dataset from README.md * Add Multi-DocVQA and its submodules * Add new datasets and update task names * Refactor flickr_aggregation_result function to accept additional arguments * Add timeout kwargs in Accelerator constructor * Add encoding to be utf-8 for cmmmu * Fix llava try and catch, remove torch.distributed.init in main * Ds prepare script for llava --------- Co-authored-by: JvThunder <joshuaadrianc@gmail.com> Co-authored-by: kcz358 <kaichenzhang358@outlook.com> commit ba0e7f5 Author: Li Bo <drluodian@gmail.com> Date: Tue Feb 27 22:52:07 2024 +0800 [Wandb Logger] add models, and args to wandb tables. (#55) * Refactor logging in lmms_eval package * Refactor variable names in lmms_eval package * Update commands.md * Add repr_scripts for reference * Add timeout for gpt4V * Remove unnecessary dependencies * Add reproduce into readme * Revise seedbench process_result * Fix exclude dc hardcode postprocess logic error * Fix metric repeat issue * Update dataset runtime and add environment info * Revise val submission file saving path * Put the correct query into the gpt extraction * Update sleep time in utils.py * update --------- Co-authored-by: Bo Li <drluodian@gmail.com> Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg> Co-authored-by: jzhang38 <a1286225768@gmail.com>
EvolvingLMMs-Lab · Mar 5, 2024 · 06a3bc3 · 06a3bc3
1 parent 7f2b2c3
commit 06a3bc3
Show file tree

Hide file tree

Showing 34 changed files with 290 additions and 270 deletions.
diff --git a/README.md b/README.md
@@ -22,18 +22,17 @@ You can evaluate the models on multiple datasets with a single command. No model
 ### Accelerator support and Tasks grouping.
 We support the usage of `accelerate` to wrap the model for distributed evaluation, supporting multi-gpu and tensor parallelism. With **Task Grouping**, all instances from all tasks are grouped and evaluated in parallel, which significantly improves the throughput of the evaluation.
 
-### Efficiency benchmark
 Below are the total runtime on different datasets using 4 x A100 40G.
-|Dataset|LLaVA-v1.5-7b|LLaVA-v1.5-13b|
+|Dataset (#num)|LLaVA-v1.5-7b|LLaVA-v1.5-13b|
 |-------|-------------|--------------|
-|mme    | 2 mins 43 seconds | 3 mins 27 seconds |
-|gqa    | 10 mins 43 seconds | 14 mins 23 seconds |
-|scienceqa_img| 1 mins 58 seconds | 2 mins 52 seconds |
-|ai2d   | 3 mins 17 seconds | 4 mins 12 seconds |
-|coco2017_cap_val| 14 mins 13 seconds | 19 mins 58 seconds |
+|mme (2374)    | 2 mins 43 seconds | 3 mins 27 seconds |
+|gqa (12578)   | 10 mins 43 seconds | 14 mins 23 seconds |
+|scienceqa_img (2017) | 1 mins 58 seconds | 2 mins 52 seconds |
+|ai2d (3088)  | 3 mins 17 seconds | 4 mins 12 seconds |
+|coco2017_cap_val (5000) | 14 mins 13 seconds | 19 mins 58 seconds |
 
 ### Prepared HF datasets.
-We are hosting more than 40 (and it's increasing) datasets on [huggingface/lmms-lab](https://huggingface.co/lmms-lab), we carefully converted these datasets from original sources and included all variants, versions and splits. Now they can be directly accessed without any burden of data preprocessing. They also serve for the purpose of visualizing the data and grasping the sense of evaluation tasks distribution.
+We are hosting more than 40 (and increasing) datasets on [huggingface/lmms-lab](https://huggingface.co/lmms-lab), we carefully converted these datasets from original sources and included all variants, versions and splits. Now they can be directly accessed without any burden of data preprocessing. They also serve for the purpose of visualizing the data and grasping the sense of evaluation tasks distribution.
 
 <p align="center" width="100%">
 <img src="https://i.postimg.cc/8PXFW9sk/WX20240228-123110-2x.png"  width="100%" height="80%">
@@ -45,6 +44,8 @@ Including prompt pre-processing, output post-processing, answer extraction, mode
 ### Reproducible results (for LLaVA series models) and Logging Utilites.
 We provide a set of pre-defined configurations & environments for llava-1.5, which can be directly used to reproduce the results in the paper.
 
+You can refer to the [repr_scripts.sh](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/dev/readme/miscs/repr_scripts.sh) we provide to see how to build and set-up the enviroments to reproduce the results from the paper. However, this environment is not recommended when you try to evaluating your own model or other models since it only install packages necessary to run llava and has a lower pytorch version that may results in a lower speed.
+
 With `lmms-eval`, all evaluation details will be recorded including log samples and results, generating report tables to terminal output and to Weights & Biases Runs/Tables.
 
 > Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub.
@@ -70,6 +71,8 @@ cd LLaVA
 pip install -e .
 ```
 
+You can check the [environment install script](miscs/repr_scripts.sh) and [torch environment info](miscs/repr_torch_envs.txt) to reproduce LLaVA-1.5's paper results. We found torch/cuda versions difference would cause small variations in the results, we provide the [results check](miscs/llava_result_check.md) with different environments.
+
 If you want to test on caption dataset such as `coco`, `refcoco`, and `nocaps`, you will need to have `java==1.8.0 ` to let pycocoeval api to work. If you don't have it, you can install by using conda
 ```
 conda install openjdk=8
@@ -209,10 +212,10 @@ Please refer to our [documentation](docs/README.md).
 
 # Acknowledgement
 
-The API, togegher with many code blocks of this project come from [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness). We recommend you to read through the [docs of lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) for relevant informations. 
+lmms_eval is a fork of [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness). We recommend you to read through the [docs of lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) for relevant information. 
 
 Below are the changes we made to the original API:
 
-- Build context now only pass in idx and process image and doc during the model responding phase. This is due to the fact that dataset now contains lots of images and we can't store them in the doc like the original lm-eval-harness other wise the memory would explode.
+- Build context now only pass in idx and process image and doc during the model responding phase. This is due to the fact that dataset now contains lots of images and we can't store them in the doc like the original lm-eval-harness other wise the cpu memory would explode.
 - Instance.args (lmms_eval/api/instance.py) now contains a list of images to be inputted to lmms.
 - lm-eval-harness supports all HF language models as single model class. Currently this is not possible of lmms because the input/output format of lmms in HF are not yet unified. Thererfore, we have to create a new class for each lmms model. This is not ideal and we will try to unify them in the future.
diff --git a/demo.tape b/demo.tape
diff --git a/docs/README.md b/docs/README.md
@@ -6,7 +6,6 @@ Majority of this documentation is adapted from [lm-eval-harness](https://github.
 
 ## Table of Contents
 
-* To learn about the command line flags, see the [commands](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/docs/commands.md)
-* To learn how to add a new moddel,  see the [Model Guide](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/docs/model_guide.md).
-* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/docs/new_task_guide.md).
-* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/docs/task_guide.md).
+* To learn about the command line flags, see the [commands](commands.md)
+* To learn how to add a new moddel,  see the [Model Guide](model_guide.md).
+* For a crash course on adding new tasks to the library, see our [Task Guide](task_guide.md).
diff --git a/docs/commands.md b/docs/commands.md
@@ -12,7 +12,7 @@ This mode supports a number of command-line arguments, the details of which can
 
 * `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=liuhaotian/llava-v1.5-7b,batch_size=1`. For a full list of what keyword arguments, see the initialization of the corresponding model class in `lmms_eval/models/`.
 
-* `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.
+* `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. You can use `--tasks list` to see all the available tasks. If you add your own tasks but not shown on the list, you can try to set `--verbosity=DEBUG` to view the error message. You can also use `--tasks list_with_num` to check every tasks and the number of question each task contains. However, `list_with_num` will download all the available datasets and may require lots of memory and time.
 
 * `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
 

diff --git a/docs/model_guide.md b/docs/model_guide.md
@@ -19,9 +19,7 @@ Now, we'll create a new file where we'll be adding our model:
 touch lmms_eval/models/<my_model_filename>.py
 ```
 
-As a rule of thumb, we recommend you to use `lmms_eval/models/qwen_vl.py` and `lmms_eval/models/instructblip.py` as reference implementations for your model. You can copy and paste the contents of one of these files into your new file to get started.
-
-**Tip: this filename should not shadow package names! For example, naming your file `anthropic.py` is disallowed since the API's name on pypi is `anthropic`, but naming it `anthropic_llms.py` works with no problems.**
+**As a rule of thumb, we recommend you to use `lmms_eval/models/qwen_vl.py` and `lmms_eval/models/instructblip.py` as reference implementations for your model. You can copy and paste the contents of one of these files into your new file to get started.**
 
 ## Interface
 
@@ -35,11 +33,6 @@ class MyCustomLM(lmms):
     def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]:
         #...
 
-
-    def loglikelihood_rolling(self, requests: list[Instance]) -> list[tuple[float, bool]]:
-        #...
-
-
     def generate_until(self, requests: list[Instance]) -> list[str]:
         #...
     #...
@@ -61,11 +54,6 @@ All three request types take as input `requests` of type `list[Instance]` that h
   - In each `Instance.args` there will be 6 elements which are ` contexts, doc_to_target, doc_to_visual, doc_id, task, split`. `contexts` refers to the formatted question and is the text input for the LMM. Sometimes it might contains image token and need to address differently for different models. `doc_to_target` is a function reference that get the get the answer from the doc. This will be the continuation of the answer and only tokens belong to this part should be calculated for the loglikelihood.
   - Each request will have, as result, `(ll, is_greedy): Tuple[float, int]` returned, where `ll` is a floating point number representing the log probability of generating the target string conditioned on the input, and `is_greedy` being either the value `0` or `1`, with it being `1` if and only if the target string *would be generated by greedy sampling from the LM* (that is, if the  target string is the *most likely* N-token string to be output by the LM given the input. )
 
-- `loglikelihood_rolling`
-  - Each request contains `Instance.args : Tuple[str]`, which is an input string to the model whose *entire* loglikelihood, conditioned on purely the EOT token, will be calculated.
-  - This is used to evaluate *perplexity* on a data distribution.
-  - It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input.
-
 
 
 

diff --git a/docs/task_guide.md b/docs/task_guide.md
@@ -0,0 +1,113 @@
+# Task Configuration
+
+The `lmms_eval` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format.
+
+These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lmms_eval` task implementations.
+
+While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups also exist. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.
+
+## Good Reference Tasks
+
+Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include:
+
+Generation-based tasks:
+
+- MME (`lmms_eval/tasks/mme/mme.yaml`)
+
+```yaml
+dataset_path: lmms-lab/MME
+dataset_kwargs:
+  token: True
+task: "mme"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.mme_doc_to_visual
+doc_to_text: !function utils.mme_doc_to_text
+doc_to_target: "answer"
+generation_kwargs:
+  max_new_tokens: 16
+  temperature: 0
+  top_p: 0
+  num_beams: 1
+  do_sample: false
+# The return value of process_results will be used by metrics
+process_results: !function utils.mme_process_results
+# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
+metric_list:
+  - metric: mme_percetion_score
+    aggregation: !function utils.mme_aggregate_results
+    higher_is_better: true
+  - metric: mme_cognition_score
+    aggregation: !function utils.mme_aggregate_results
+    higher_is_better: true
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: "\nAnswer the question using a single word or phrase."
+  qwen_vl:  
+    pre_prompt: ""
+    post_prompt: " Answer:"
+metadata:
+  - version: 0.0
+```
+
+You can pay special attention to the `process_results` and `metric_list` fields, which are used to define how the model output is post-processed and scored.
+Also, the `model_specific_prompt_kwargs` field is used to define model-specific prompt configurations. The default is set to follow Llava.
+
+PPL-based tasks:
+- Seedbench (`lmms_eval/tasks/seedbench/seedbench_ppl.yaml`)
+
+```yaml
+dataset_path: lmms-lab/SEED-Bench
+dataset_kwargs:
+  token: True
+task: "seedbench_ppl"
+test_split: test
+output_type: multiple_choice
+doc_to_visual: !function utils.seed_doc_to_visual
+doc_to_text: !function utils.seed_doc_to_text_mc
+doc_to_choice : !function utils.seed_doc_to_choice
+doc_to_target: !function utils.seed_doc_to_mc_target
+# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
+metric_list:
+  - metric: acc
+metadata:
+  - version: 0.0
+```
+
+## Configurations
+
+Tasks are configured via the `TaskConfig` object. Below, we describe all fields usable within the object, and their role in defining a task.
+
+### Parameters
+
+Task naming + registration:
+- **task** (`str`, defaults to None) — name of the task.
+- **group** (`str`, *optional*) — name of the task group(s) a task belongs to. Enables one to run all tasks with a specified tag or group name at once.
+
+Dataset configuration options:
+- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
+- **dataset_name**  (`str`, *optional*, defaults to None) — The name of what HF calls a “config” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
+- **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
+- **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
+- **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
+- **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
+- **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0. **This function is not well tested so far**
+- **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template.
+
+Prompting / in-context formatting options:
+- **doc_to_text** (`Union[Callable, str]`, *optional*) — Column name or function to process a sample into the appropriate input for the model
+- **doc_to_visial** (`Union[Callable, str]`, *optional*) — Function to process a sample into the appropriate input images for the model.
+- **doc_to_target** (`Union[Callable, str]`, *optional*) — Column name or or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into
+- **doc_to_choice** (`Union[Callable, str]`, *optional*) — Column name or or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `generate_until` tasks.
+
+Runtime configuration options:
+- **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input. **This function is not well tested so far**
+- **batch_size** (`int`, *optional*, defaults to 1) — Batch size. 
+
+**So far some models (such as qwen) may not support batch size > 1. Some models (such as llava) will generate different scores for different batch sizes. We recommend setting batch size to 1 for final benchmarking runs.** 
+
+Scoring details:
+- **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation.
+- **output_type** (`str`, *optional*, defaults to "generate_until") — Selects the type of model output for the given task. Options are `generate_until`, `loglikelihood`, and `multiple_choice`.
+- **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
diff --git a/llava_repr_requirements.txt b/llava_repr_requirements.txt
@@ -0,0 +1,33 @@
+llava@git+https://github.com/haotian-liu/LLaVA@v1.1.3
+accelerate>=0.21.0
+black==24.1.0
+datasets==2.16.1
+evaluate>=0.4.0
+jsonlines
+numexpr
+peft>=0.2.0
+pybind11>=2.6.2
+pytablewriter
+rouge-score>=0.0.4
+sacrebleu>=1.5.0
+scikit-learn>=0.24.1
+sqlitedict
+torch==2.0.1
+openai>=1.0.0
+pycocoevalcap
+tqdm-multiprocess
+transformers>=4.36.2
+zstandard
+pillow
+pyyaml
+sympy
+mpmath
+Jinja2
+openpyxl
+Levenshtein
+hf_transfer
+tenacity
+wandb>=0.16.0
+transformers-stream-generator
+tiktoken
+pre-commit
diff --git a/lmms_eval/__main__.py b/lmms_eval/__main__.py
@@ -298,6 +298,8 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
     if results is not None:
         if args.log_samples:
             samples = results.pop("samples")
+        else:
+            samples = None
         dumped = json.dumps(results, indent=4, default=_handle_non_serializable)
         if args.show_config:
             print(dumped)

diff --git a/lmms_eval/api/instance.py b/lmms_eval/api/instance.py
@@ -4,7 +4,7 @@
 
 @dataclass
 class Instance:
-    request_type: Literal["loglikelihood", "loglikelihood_rolling", "generate_until"]
+    request_type: Literal["loglikelihood", "generate_until"]
     arguments: tuple
     idx: int
     metadata: Tuple[str, int, int] = field(default_factory=lambda: (None, None, None))  # TODO: better typehints here